Expanding the Cloud - Amazon S3 Reduced Redundancy Storage

| | Comments (6)

Today a new storage option for Amazon S3 has been launched: Amazon S3 Reduced Redundancy Storage (RRS). This new storage option enables customers to reduce their costs by storing non-critical, reproducible data at lower levels of redundancy. This has been an option that customers have been asking us about for some time so we are really pleased to be able to offer this alternative storage option now.

Durability in Amazon S3

Amazon Simple Storage Service (S3) was launched in 2006 as "Storage for the Internet" with the promise to make web-scale computing easier for developers. Four years later it stores over 100 billion objects and routinely performs well over 120,000 storage operations per second. Its use cases span the whole storage spectrum; from enterprise database backup to media distribution, from consumer file storage to genomics datasets, from image serving for websites to telemetry storage for NASA.

Core to its success has been its simplicity: no matter how many objects you want to store, how small or big those objects are, or what object access patterns you have to deal with, you can rely on Amazon S3 to take away the headaches of dealing with the hard issues typically associated with big storage systems. All the complexities of scalability, reliability, durability, performance and cost-effectiveness are hidden behind a very simple programming interface.

Under the covers Amazon S3 is a marvel of distributed systems technologies. It is the ultimate incrementally scalable system; simply by adding resources it can handle scaling needs in storage and performance dimensions. It also needs to handle every possible failure of storage devices, of servers, networks and operating systems, all while continuing to serve hundreds of thousands of customers.

The goal for operational excellence in Amazon S3 (and for all other AWS services) is that it should be "indistinguishable from perfect". While individual components may be failing all the time, as is normal in any large scale system, the customers of the service should be completely shielded from this. For example to ensure availability of data, the data is replicated over multiple locations such that failure modes are independent of each other. The locations are chosen with great care to achieve this independence and if one or more of those locations becomes unreachable, S3 can continue to serve its customers. Some of the biggest innovations inside Amazon S3 have been how to use software techniques to mask many of the issues that would easily have paralyzed every other storage system.

The same goes for durability; core to the design of S3 is that we go to great lengths to never, ever lose a single bit. We use several techniques to ensure the durability of the data our customers trust us with, and some of those (e.g. replication across multiple devices and facilities) overlap with those we use for providing high-availability. One of the things that S3 is really good at is deciding what action to take when failure happens, how to re-replicate and re-distribute such that we can continue to provide the availability and durability the customers of the service have come to expect. These techniques allow us to design our service for 99.999999999% durability.

Relaxing Durability

There are many innovative techniques we deploy to provide this durability and a number of them are related to the redundant storage of data. As such, the cost of providing such a high durability is an important component of the storage pricing of S3. While this high durability is exactly what most customers want, there are some usage scenarios where customers have asked us if we could relax the durability in exchange for a reduction in cost. In these scenarios, the customers are able to reproduce the object if it would ever be lost, either because they are storing another authoritative copy or because they can reconstruct the object from other sources.

We can now offer these customers the option to use Amazon S3 Reduced Redundancy Storage (RRS), which provides 99.99% durability at significantly lower cost. This durability is still much better than that of a typical storage system as we still use some forms of replication and other techniques to maintain a level of redundancy. Amazon S3 is designed to sustain the concurrent loss of data in two facilities, while the RRS storage option is designed to sustain the loss of data in a single facility. Because RRS is redundant across facilities, it is highly available and backed by the Amazon S3 Service Level Agreement.

More details on the new option and its pricing advantages can be found on the Amazon S3 product page. Other insights in the use of this option can be read on the AWS developer blog.

6 Comments

If S3 is storing 100 billion objects with 99.999999999% durability over a year, the expected rate of data loss is 1 object per year. However, I imagine that objects are not independent here, so the number of objects lost each year would not follow a Poisson distribution.

Can you estimate the probability -- hopefully rather more than 1/e -- of having zero objects lost in a year?

I've developed a backup application, HashBackup, that uses S3 as one of its backup targets, as well as SSH, rsync, ftp, imap (email), and mounted storage (nfs).

In its default configuration, HashBackup maintains a local copy of the compressed backup archives, on the theory that most users have plenty of local disk space. The local copy also comes in handy since it avoids the need to download backup data to perform file retention/pruning.

So for HashBackup, it would be useful to have an S3 redundancy storage class of None, with dirt cheap pricing. I'm assuming here that behind the scenes, S3 tries to avoid storing all of one bucket's objects on a single disk drive, so that if an S3 drive failure occurs, it may affect a large number of objects, but the objects would be distributed over a large number of buckets, ie, you wouldn't lose all the objects in a single bucket.

Another thing that would be useful is an API call to fetch a list of objects missing from a bucket. For HashBackup, it's not that important: backup data from many files is packed into large-ish archive files, so there aren't that many distinct S3 objects, so it's easy to periodically check them all. But for something like an image archive, where each image is its own object, it would be impractical to fetch each object every day just to see if it had been lost and needed to be refreshed from local storage. Much easier to issue one call and get a list of missing objects to refresh.

For apps like image serving, another useful feature would be a missing object redirect URL. If I have a large image store, say for an e-commerce site, it could all be uploaded to S3 with RRS of None and a metadata tag with a missing object redirect URL to my local image server. If S3 ever completely lost an image, instead of return a 405 Method Not Allowed error to my customer, S3 could return a temporary redirect back to my image server. This would let me serve the image from local storage without my customer seeing a failure, and also serve as notification to check whether the object needs to be uploaded again to S3. If the missing object is deleted from S3, the temporary redirect would of course stop. As I'm thinking about it, I guess there would also have to be some kind of expiration time, so that if an S3 object goes missing, any record of it would eventually be removed.

Thanks for providing S3. It lets small companies compete with huge companies!

Jim

Anonymous said:

The probability of, say, the United States government failing, or a biological catastrophe that leaves 99.99% of people in the world dead, is higher than one-in-10-million-years. I have no doubt that durability in S3 is excellent, but claiming 11-9's seems pretty ridiculous.

Michael said:

This is a huge development for S3. Previously, it has been an excellent option for burst storage and for critical data retention, but now it is also becoming a viable option (from a cost perspective) for transient and non-critical storage (like thumbnails, publishing, and content management).

frank Domoney said:

Is there a standard definition of durability?

I am familiar with availability and survivability but not durability as a metric.

Kind regards

Frank

frank Domoney said:

Apologies for my previous query.

The definition of durability is on the AWS page.

Kind regards

Frank

About this Entry

This page contains a single entry by Werner Vogels published on May 18, 2010 11:00 PM.

Expanding the Cloud - Opening the AWS Asia Pacific (Singapore) Region was the previous entry in this blog.

Expanding the Cloud - Cluster Compute Instances for Amazon EC2 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.