S3 Files and the changing face of S3

• 6004 words

Photo credit: Ossewa

Almost everyone at some point in their career has dealt with the deeply frustrating process of moving large amounts of data from one place to another, and if you haven’t, you probably just haven’t worked with large enough datasets yet. For Andy Warfield, one of those formative experiences was at UBC, working alongside genomics researchers who were producing extraordinary volumes of sequencing data but spending an absurd amount of their time on the mechanics of getting that data where it needed to be. Forever copying data back and forth, managing multiple inconsistent copies. It is a problem that has frustrated builders across every industry, from scientists in the lab to engineers training machine learning models, and it is exactly the type of problem that we should be solving for our customers.

In this post, Andy writes about the solution that his team came up with: S3 Files. The hard-won lessons, a few genuinely funny moments, and at least one ill-fated attempt to name a new data type. It is a fascinating read that I think you’ll enjoy.

–W


Part 1: The Changing Face of S3

First, some botany

It turns out that sunflowers are a lot more promiscuous than humans. 

About a decade ago, just before joining Amazon, I had wrapped up my second startup and was back teaching at UBC. I wanted to explore something that I didn’t have a lot of research experience with and decided to learn about genomics, and in particular the intersection of computer systems and how biologists perform genomics research. I wound up spending time with Loren Rieseberg, a botany professor at UBC who studies sunflower DNA—analyzing genomes to understand how plants develop traits that let them thrive in challenging environments like drought or salty soils.

The botanists’ joke about promiscuity (the one that started this blog) was one reason why Loren’s lab was so fun to work with. Their explanation was that human DNA has about 3 billion base pairs, and any two humans are 99.9% identical at a genomic level—all of our DNA is remarkably similar. But sunflowers, being flowers, and not at all monogamous, have both larger genomes (about 3.6 billion base pairs) and way more variation (10 times more genetic variation between individuals).

One of my PhD grads at the time, JS Legare, decided to join me on this adventure and went on to do a postdoc in Loren’s lab, exploring how we might move these workloads to the cloud. Genomic analysis is an example of something that some researchers have called “burst parallel” computing. Analyzing DNA can be done with massive amounts of parallel computation, and when you do that it often runs for relatively short periods of time. This means that using local hardware in a lab can be a poor fit, because you often don’t have enough compute to run fast analysis when you need to, and the compute you do have sits idle when you aren’t doing active work. Our idea was to explore using S3 and serverless compute to run tens or hundreds of thousands of tasks in parallel so that researchers could run complex analysis very very quickly, and then scale down to zero when they were done.

The biologists worked in Linux with an analytics framework called GATK4—a genomic analysis toolkit with integration for Apache Spark. All of their data lived on a shared NFS filer. In bridging to the cloud, JS built a system he called “bunnies” (another promiscuity joke) to package analyses in containers and run them on S3, which was a real win for velocity, repeatability, and performance through parallelization. But a standout lesson was the friction at the storage boundary.

S3 was great for parallelism, cost, and durability, but every tool the genomics researchers used expected a local Linux filesystem. Researchers were forever copying data back and forth, managing multiple, sometimes inconsistent copies. This data friction—S3 on one side, a filesystem on the other, and a manual copy pipeline in between—is something I’ve seen over and over in the years since. In media and entertainment, in pretraining for machine learning, in silicon design, and in scientific computing. Different tools are written to access data in different ways and it sucks when the API that sits in front of our data becomes a source of friction that makes it harder to work with.

Agents amplify data friction

We are all aware, and I think still maybe even a little stunned, at the way that agentic tooling is changing software development today. Agents are pretty darned good at writing code, and they are getting better at it fast enough that we’re all spending a fair bit of time thinking about what it all even means (even Werner). One thing that does really seem true though is that agentic development has profoundly changed the cost of building applications. Cost in terms of dollars, in terms of time, and especially in terms of the skill associated with writing workable code. And it’s this last part that I’ve been finding the most exciting lately, because for about as long as we’ve had software, successful applications have always involved combining two often disjointed skillsets: On one hand skill in the domain of the application being written, like genomics, or finance, or design, and on the other hand skill in actually writing code. In a lot of ways, agents are illustrating just how prohibitively high the barrier to entry for writing software has always been, and are suddenly allowing apps to be written by a much larger set of people–people with deep skills in the domains of the applications being written, rather than in the mechanics of writing them.

As we find ourselves in this spot where applications are being written faster, more experimentally, more diversely than ever, the cycle time from idea to running code is compressing dramatically. As the cost of building applications collapses, and as each application we build can serve as a reference for the next one, it really feels like the code/data division is becoming more meaningful than it has ever been before. We are entering a time where applications will come and go, and as always, data outlives all of them. The role of effective storage systems has always been not just to safely store data, but also to help abstract and decouple it from individual applications. As the pace of application development accelerates, this property of storage has become more important than ever, because the easier data is to attach to and work with, the more that we can play, build, and explore new ways to benefit from it.

S3 as a steward for your data

Over the past few years, the S3 team has been really focused on this last point. We’ve been looking closely at situations where the way that data is accessed in S3 just isn’t simple enough–precisely like the example of biologists in Loren’s lab having to build scripts to copy data around so that it’s in the right place to use with their tooling–and we started looking more broadly at places where customers were finding that working with storage was distracting them from working with data. The first lesson that we had here was with structured data. S3 stores exabytes of parquet data and averages over 25 million requests per second to that format alone. A lot of this was either as plain parquet or structured as Hive tables. And it was clear that people wanted to do more with this data. Open table formats, notably Apache Iceberg, were emerging as functionally richer table abstractions allowing insertions and mutations, schema changes, and snapshots of tables. While Iceberg was clearly helping lift the level of abstraction for tabular data on S3, it also still carried a set of sharp edges because it was having to surface tables strictly over the object API.

As Iceberg started to grow in popularity, customers who adopted it at scale told us that managing security policy was difficult, that they didn’t want to have to manage table maintenance and compaction, and that they wanted working with tabular data to be easier. Moreover, a lot of work on Iceberg and Open Table Formats (OTFs) generally was being driven specifically for Spark. While Spark is very important as an analytics engine, people store data in S3 because they want to be able to work with it using any tool they want, even (and especially!) the tools that don’t exist yet. So in 2024, at re:Invent, we launched S3 Tables as a managed, first-class table primitive that can serve as a building block for structured data. S3 Tables stores data in Iceberg, but adds guardrails to protect data integrity and durability. It makes compaction automatic, adds support for cross-region table replication, and continues to refine and extend the idea that a table should be a first-class data primitive that sits alongside objects as a way to build applications. Today we have over 2 million tables stored in S3 Tables and are seeing all sorts of remarkable applications built on top of them.

At around the same time, we were beginning to have a lot of conversations about similarity search and vector indices with S3 customers. AI advances over the past few years have really created both an opportunity and a need for vector indexes over all sorts of stored data. The opportunity is provided by advanced embedding models, which have introduced a step-function change in the ability to provide semantic search. Suddenly, customers with large archival media collections, like historical sports footage, could build a vector index and do a live search for a specific player scoring diving touchdowns and instantly get a collection of clips, assembled as a hit reel, that can be used in live broadcast. That same property of semantically relevant search is equally valuable for RAG and for applying models over data they weren’t trained on.

As customers started to build and operate vector indexes over their data, they began to highlight a slightly different source of data friction. Powerful vector databases already existed, and vectors had been quickly working their way in as a feature on existing databases like Postgres. But these systems stored indexes in memory or on SSD, running as compute clusters with live indices. That’s the right model for a continuous low-latency search facility, but it’s less helpful if you’re coming to your data from a storage perspective. Customers were finding that, especially over text-based data like code or PDFs, that the vectors themselves were often more bytes than the data being indexed, stored on media many times more expensive.

So just like with the team’s work on structured data with S3 Tables, at the last re:Invent we launched S3 Vectors as a new S3-native data type for vector indices. S3 Vectors takes a very S3 spin on storing vectors in that its design anchors on a performance, cost and durability profile that is very similar to S3 objects. Probably most importantly though, S3 Vectors is designed to be fully elastic, meaning that you can quickly create an index with only a few hundred records in it, and scale over time to billions of records. S3 Vector’s biggest strength is really with the sheer simplicity of having an always-available API endpoint that can support similarity search indices. Just like objects and tables, it’s another data primitive that you can just reach for as part of application development.

And now… S3 Files

Today, we are launching S3 Files, a new S3 feature that integrates the Amazon Elastic File System (EFS) into S3 and allows any existing S3 data to be accessed directly as a network attached file system.

The story about files is actually longer, and even more interesting than the work on either Tables or Vectors, because files turn out to be a complex and tricky data type to cleanly integrate with object storage. We actually started working on the files idea before we launched S3 Tables, as a joint effort between the EFS and S3 teams, but let’s put a pin in that for a second.

As I described with the genomics example of analyzing sunflower DNA, there is an enormous body of existing software that works with data through filesystem APIs, data science tools, build systems, log processors, configuration management, and training pipelines. If you have watched agentic coding tools work with data, they are very quick to reach for the rich range of Unix tools to work directly with data in the local file system. Working with data in S3 means deepening the reasoning that they have to do to actively go list files in S3, transfer them to the local disk, and then operate on those local copies. And it’s obviously broader than just the agentic use case, it’s true for every customer application that works with local file systems in their jobs today. Natively supporting files on S3 makes all of that data immediately more accessible—and ultimately more valuable. You don’t have to copy data out of S3 to use pandas on it, or to point a training job at it, or to interact with it using a design tool.

With S3 Files, you get a really simple thing. You can now mount any S3 bucket or prefix inside your EC2 VM, container, or Lambda function and access that data through your file system. If you make changes, your changes will be propagated back to S3. As a result, you can work with your objects as files, and your files as objects.

And this is where the story gets interesting, because as we often learn when we try to make things simple for customers, making something simple is often one of the more complicated things that you can set out to do.

Part 2: The Design of S3 Files

Builders hate the fact that they have to decide early on whether their data is going to live in a file system or an object store, and to be stuck with the consequences of that from then on. With that decision, they are basically picking how they are going to interact with their data not just now, but long into the future, and if they get it wrong they either have to do a migration or build a layer of automation for copying data.

Early on, the idea was basically that we would just put EFS and S3 in a giant pot, simmer it for a bit, and we would get the best of both worlds. We even called the early version of the project “EFS3” (and I’m glad we didn’t keep that name!). But things got tricky in a hurry. Every time we sat down to work through designs, we found difficult technical challenges and tough decisions. And in each of these decisions, either the file or the object presentation of data would have to give something up in the design that would make it a bit less good. One of the engineers on the team described this as “a battle of unpalatable compromises.”  We were hardly the first storage people to discover how difficult it is to converge file and object into a single storage system, but we were also acutely aware of how much not having a solution to the problem was frustrating builders.

We were determined to find a path through it so we did the only sensible thing you can do when you are faced with a really difficult technical design problem: we locked a bunch of our most senior engineers in a room and said we weren’t going to let them out till they had a plan that they all liked.

Passionate and contentious discussions ensued. And ensued. And ensued. And eventually we gave up. We just couldn’t get to a solution that didn’t leave someone (and in most cases really everyone) unhappy with the design.

A quick aside at this point: I may be taking some dramatic liberties with the comment about locking people in a room. The Amazon meeting rooms don’t have locks on them. But to be clear on this point: I frequently find that we make the fastest and most constructive progress on really hard design problems when we get smart, passionate people with differing technical views in front of a whiteboard to really dig in over a period of days. This isn’t an earth-moving observation, but it’s often surprising how easy it can be to forget in the face of trying to talk through big hard problems in one-hour blocks over video conference. The engineers in these discussions deeply understood file and object workloads and the subtleties of how different they can be, and so these discussions were deep, sometimes heated, and absolutely fascinating. And despite all of this, we still couldn’t get to a design that we liked. It was really frustrating.

This was around Christmas of 2024. Leading into the holidays, the team changed course. They went through the design docs and discussion notes that they had and started to enumerate all of the specific design compromises and the behaviour that we would need to be comfortable with if we wanted to present both file and object interfaces as a single unified system. We all looked at it and agreed that it wasn’t the best of both worlds, it was the lowest common denominator, and we could all think of example workloads on both sides that would break in surprising, often subtle, and always frustrating ways.

I think the example where this really stood out to me was around the top-level semantics and experience of how objects and files are actually different as data primitives. Here’s a painfully simple characterization: files are an operating system construct. They exist on storage, and persist when the power is out, but when they are used they are incredibly rich as a way of representing data, to the point that they are very frequently used as a way of communicating across threads, processes, and applications. Application APIs for files are built to support the idea that I can update a record in a database in place, or append data to a log, and that you can concurrently access that file and see my change almost instantaneously, to an arbitrary sub-region of the file. There’s a rich set of OS functionality, like mmap() that doubles down on files as shared persistent data that can mutate at a very fine granularity and as if it is a set of in-memory data structures.

Now if we flip over to object world, the idea of writing to the middle of an object while someone else is accessing it is more or less sacrilege. The immutability of objects is an assumption that is cooked into APIs and applications. Tools will download and verify content hashes, they will use object versioning to preserve old copies. Most notable of all, they often build sophisticated and complex workflows that are entirely anchored on the notifications that are associated with whole object creation. This last thing was something that surprised me when I started working on S3, and it’s actually really cool. Systems like S3 Cross Region Replication (CRR) replicate data based on notifications that happen when objects are created or overwritten and those notifications are counted on to have at-least-once semantics in order to ensure that we never miss replication for an object. Customers use similar pipelines to trigger log processing, image transcoding and all sorts of other stuff–it’s a very popular pattern for application design over objects. In fact, notifications are an example of an S3 subsystem that makes me marvel at the scale of the storage system I get to work on: S3 sends over 300 billion event notifications every day just to serverless event listeners that process new objects!

The thing that we came to realize was that there is actually a pretty profound boundary between files and objects. File interactions are agile, often mutation heavy, and semantically rich. Objects on the other hand come with a relatively focused and narrow set of semantics; and we realized that this boundary that separated them was what we really needed to pay attention to, and that rather than trying to hide it, the boundary itself was the feature we needed to build.

Stage and Commit

When we got back from the holidays, we started locking (well, ok, not exactly locking) folks in rooms again, but this time with the view that the boundary between file and object didn’t actually have to be invisible. And this time, the team started coming out of discussions looking a lot happier.

The first decision was that we were going to treat first-class file access on S3 as a presentation layer for working with data. We would allow customers to define an S3 mount on a bucket or prefix, and that under the covers, that mount would attach an EFS namespace to mirror the metadata from S3. We would make the transit and consistency of data across the two layers an absolutely central part of our design. We started to describe this as “stage and commit,” a term that we borrowed from version control systems like git—changes would be able to accumulate in EFS, and then be pushed down collectively to S3—and that the specifics of how and when data transited the boundary would be published as part of the system, clear to customers, and something that we could actually continue to evolve and improve as a programmatic primitive over time. (I’m going to talk about this point a little more at the end, because there’s much more the team is excited to do on this surface).

Being explicit about the boundary between file and object presentations is something that I did not expect at all when the team started working on S3 Files, and it’s something that I’ve really come to love about the design. It is early and there is plenty of room for us to evolve, but I think the team all feels that it sets us up on a path where we are excited to improve and evolve in partnership with what builders need, and not be stuck behind those unpalatable compromises. 

Not out of the woods

Deciding on this stage and commit thing was one of those design decisions that provided some boundaries and separation of concerns. It gave us a clear structure, but it didn’t make the hard problems go away. The team still had to navigate real tradeoffs between file and object semantics, performance, and consistency. Let me walk through a few examples to show how nuanced these two abstractions really are, and how the team approached these decisions.

Consistency and atomicity

S3 readers often assume full object updates, notifications, and in many cases access to historical versions. File systems have fine-grained mutations, but they have important consistency and atomicity tricks as well. Many applications depend on the ability to do atomic file renames as a way of making a large change visible all at once. They do the same thing with directory moves. S3 conditionals help a bit with the first thing but aren’t an exact match, and there isn’t an S3 analog for the second. So as mentioned above, separating the layers allows these modalities to coexist in parallel systems with a single view of the same data. You can mutate and rename a file all you want, and at a later point, it will be written as a whole to S3.

Authorization

Authorization is equally thorny. S3 and file systems think about authorization in very different ways. S3 supports IAM policies scoped to key prefixes—you can say “deny GetObject on anything under /private/”. In fact, you can further constrain those permissions based on things like the network or properties of the request itself. IAM policies are incredibly rich, and also much more expensive to evaluate than file permissions are. File systems have spent years getting things like permission checks off of the data path, often evaluating up front and then using a handle for persistent future access. Files are also a little weird as an entity to wrap authorization policy around, because permissions for a file live in its inode. Hard links allow you to have many inodes for the same file, and you also need to think about directory permissions that determine if you can get to a file in the first place. Unless you have a handle on it, in which case it kind of doesn’t matter, even if it’s renamed, moved, and often even deleted.

There’s a lot more complexity, erm, richness to discuss here—especially around topics like user and group identity—but by moving to an explicit boundary, the team got themselves out of having to co-represent both types of permissions on every single object. Instead, permissions could be specified on the mount itself (familiar territory for network file system users) and enforced within the file system, with specific mappings applied across the two worlds.

This design had another advantage. It preserved IAM policy on S3 as a backstop. You can always disable access at the S3 layer if you need to change a data perimeter, while delegating authorization up to the file layer within each mount. And it left the door open for situations in the future where we might want to explore multiple different mounts over the same data.

The dreadful incongruity of namespace semantics

If you are familiar with both file and object systems, it’s not a hard exercise to think about cases where file and object naming behaves quite differently. When you start to sit down and really dig into it, things get almost hilariously desolate. File systems have first-class path separators—often forward slash ("/") characters. S3 has these too, but they are really just a suggestion. In fact, S3’s LIST command allows you to specify anything you want to be parsed as a path separator and there are a handful of customers who have built remarkable multi-dimensional naming structures that embed multiple different separators in the same paths and pass a different delimiter to LIST depending on how they want to organize results.

Here’s another simple and annoying one: because S3 doesn’t have directories, you can have objects that end with that same slash. That’s to say, that you can have a thing that looks like a directory but is a file. For about 20 minutes the team thought this was a cool feature and were calling them “filerectories.” Thank goodness we didn’t keep that one.

There are tens of these differences, and we carefully thought about restricting to a single common structure or just fixing ourselves on one side or the other. On all of these paths we realized that we were going to break assumptions about naming inside applications.

We decided to lean into the boundary and allow both sides to stick with their existing naming conventions and semantics. When objects or files are created that can’t be moved across the boundary, we decided that (and wow was this ever a lot of passionate discussion) we just wouldn’t move them. Instead, we would emit an event to allow customers to monitor and take action if necessary. This is clearly an example of downloading complexity onto the developer, but I think it’s also a profoundly good example of that being the right thing to do, because we are choosing not to fail things in the domains where they already expect to run, we are building a boundary that admits the vast majority of path names that actually do work in both cases, and we are building a mechanism to detect and correct problems as they arise.

The experience of performance

The last big area of differences that the team spent a lot of time talking about was performance, and in particular the performance and request latency of namespace interactions. File and object namespaces are optimized for very different things. In a file system, there are a lot of data-dependent accesses to metadata. Accessing a file means also accessing (and in some cases updating) the directory record. There are also many operations that end up traversing all of the directory records along a path. As a result, fast file system namespaces—even big distributed ones, tend to co-locate all the metadata for a directory on a single host so that those interactions are as fast as possible. The object namespace is completely flat and tends to optimize for very highly parallel point queries and updates. There are many cases in S3 where individual “directories” have billions of objects in them and are being accessed by hundreds of thousands of clients in parallel.

As we looked through the set of challenges that I’ve just described, we spent a lot of time talking about adoption. S3 is two decades old and we wanted a solution that existing S3 customers could immediately use on their own data, and not one that meant migrating to something completely new. There are enormous numbers of existing buckets serving applications that depend on S3’s object semantics working exactly as documented. We were not willing to introduce subtle new behaviours that could break those applications.

It turns out that very few applications use both file and object interfaces concurrently on the same data at the same instant. The far more common pattern is multiphase. A data processing pipeline uses filesystem tools in one stage to produce output that’s consumed by object-based applications in the next. Or a customer wants to run analytics queries over a snapshot of data that’s actively being modified through a filesystem.

We realized that it’s not necessary to converge file and object semantics to solve the data silo problem. What they needed was the same data in one place, with the right view for each access pattern. A file view that provides full NFS close-to-open consistency. An object view that provides full S3 atomic-PUT strong consistency. And a synchronization layer that keeps them connected.

So we shipped it

All of that arguing—the team’s list of “unpalatable compromises”, the passionate and occasionally desolate discussions about filerectories—turned out to be exactly the work we needed to do. I think the team all feels that the design is better for having gone through it. S3 Files lets you mount any S3 bucket or prefix as a filesystem on your EC2 instance, container, or Lambda function. Behind the scenes it’s backed by EFS, which provides the file experience your tools already expect. NFS semantics, directory operations, permissions. From your application’s perspective, it’s a mounted directory. From S3’s perspective, the data is objects in a bucket.

The way it works is worth a quick walk through. When you first access a directory, S3 Files imports metadata from S3 and populates a synchronized view. For files under 128 KB it also pulls the data itself. For larger files only metadata comes over and the data is fetched from S3 when you actually read it. This lazy hydration is important because it means that you can mount a bucket with millions of objects in it and just start working immediately. This “start working immediately” part is a good example of a simple experience that is actually pretty sophisticated under the covers–being able to mount and immediately work with objects in S3 as files is an obvious and natural expectation for the feature, and it would be pretty frustrating to have to wait minutes or hours for the file view of metadata to be populated. But under the covers, S3 Files needs to scan S3 metadata and populate a file-optimized namespace for it, and the team was able to make this happen very quickly, and as a background operation that preserves a simple and very agile customer experience.

When you create or modify files, changes are aggregated and committed back to S3 roughly every 60 seconds as a single PUT. Sync runs in both directions, so when other applications modify objects in the bucket, S3 Files automatically spots those modifications and reflects them in the filesystem view automatically. If there is ever a conflict where files are modified from both places at the same time, S3 is the source of truth and the filesystem version moves to a lost+found directory with a CloudWatch metric identifying the event. File data that hasn’t been accessed in 30 days is evicted from the filesystem view but not deleted from S3, so storage costs stay proportional to your active working set.

There are many smaller, and really fun bits of work that happened as the team built the system. One of the improvements that I think is really cool is what we are calling “read bypass.” For high-throughput sequential reads, read bypass automatically reroutes the read data path to not use traditional NFS access, and instead to perform parallel GET requests directly to S3 itself, this approach achieves 3 GB/s per client (with further room to improve) and scales to terabits per second across multiple clients. And for those who are interested, there’s way more detail in our technical docs (which are a pretty interesting read).

One thing I’ve really come to appreciate about the design is how honest it is about its own edges. The explicit boundary between file and object domains isn’t a limitation we’re papering over. It’s the thing that lets both sides remain uncompromised. That said, there are places where we know we still have work to do. Renames are expensive because S3 has no native rename operation, so renaming a directory means copying and deleting every object under that prefix. We warn you when a mount covers more than 50 million objects for exactly this reason. Explicit commit control isn’t there at launch; the 60-second window works for most workloads but we know it won’t be enough for everyone. And there are object keys that simply can’t be represented as valid POSIX filenames, so they won’t appear in the filesystem view. We’ve been in customer beta for about nine months and these are the things that we’ve learned and continued to evolve and iterate on with early customers. We’d rather be clear about them than pretend they don’t exist.

Files and Sunflowers

When we were working with Loren’s lab at UBC, JS spent a remarkable amount of his time building caching and naming layers – not doing biology, but writing infrastructure to shuttle data between where it lived and where tools expected it to be. That friction really stood out to me, and looking back at it now, I think the lesson we kept learning – in that lab, and then over and over again as the S3 team worked on Tables, Vectors, and now Files – is that different ways of working with data aren’t a problem to be collapsed. They’re a reality to be served. The sunflowers in Loren’s lab thrived on variation, and it turns out data access patterns do too.

What I find most exciting about S3 Files is something I genuinely did not expect when we started: that the explicit boundary between file and object turned out to be the best part of the design. We spent months trying to make it disappear, and when we finally accepted it as a first-class element of the system, everything got better. Stage and commit gives us a surface that we can continue to evolve – more control over when and how data transits the boundary, richer integration with pipelines and workflows–and it sets us up to do that without compromising either side.

20 years ago, S3 started as an object store. Over the past couple of years, with Tables, Vectors, and now Files, it’s become something broader. A place where data lives durably and can be worked with in whatever way makes sense for the job at hand. Our goal is for the storage system to get out of the way of your work, not to be a thing that you have to work around. We’re nowhere near done, but I’m really excited about the direction that we’re heading in.

As Werner says, “Now, go build!”