Eventually Consistent - Revisited
I wrote a first version of this posting on consistency models about a year ago, but I was never happy with it as it was written in haste and the topic is important enough to receive a more thorough treatment. ACM Queue asked me to revise it for use in their magazine and I took the opportunity to improve the article. This is that new version.
Eventually Consistent - Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability.
At the foundation of Amazon's cloud computing are infrastructure services such as Amazon's S3 (Simple Storage Service), SimpleDB, and EC2 (Elastic Compute Cloud) that provide the resources for constructing Internet-scale computing platforms and a great variety of applications. The requirements placed on these infrastructure services are very strict; they need to score high marks in the areas of security, scalability, availability, performance, and cost effectiveness, and they need to meet these requirements while serving millions of customers around the globe, continuously.
Under the covers these services are massive distributed systems that operate on a worldwide scale. This scale creates additional challenges, because when a system processes trillions and trillions of requests, events that normally have a low probability of occurrence are now guaranteed to happen and need to be accounted for up front in the design and architecture of the system. Given the worldwide scope of these systems, we use replication techniques ubiquitously to guarantee consistent performance and high availability. Although replication brings us closer to our goals, it cannot achieve them in a perfectly transparent manner; under a number of conditions the customers of these services will be confronted with the consequences of using replication techniques inside the services.
One of the ways in which this manifests itself is in the type of data consistency that is provided, particularly when the underlying distributed system provides an eventual consistency model for data replication. When designing these large-scale systems at Amazon, we use a set of guiding principles and abstractions related to large-scale data replication and focus on the trade-offs between high availability and data consistency. In this article I present some of the relevant background that has informed our approach to delivering reliable distributed systems that need to operate on a global scale. An earlier version of this text appeared as a posting on the All Things Distributed weblog in December 2007 and was greatly improved with the help of its readers.
Historical Perspective
In an ideal world there would be only one consistency model: when an update is made all observers would see that update. The first time this surfaced as difficult to achieve was in the database systems of the late '70s. The best "period piece" on this topic is "Notes on Distributed Databases" by Bruce Lindsay et al. 5 It lays out the fundamental principles for database replication and discusses a number of techniques that deal with achieving consistency. Many of these techniques try to achieve distribution transparency—that is, to the user of the system it appears as if there is only one system instead of a number of collaborating systems. Many systems during this time took the approach that it was better to fail the complete system than to break this transparency.2
In the mid-'90s, with the rise of larger Internet systems, these practices were revisited. At that time people began to consider the idea that availability was perhaps the most important property of these systems, but they were struggling with what it should be traded off against. Eric Brewer, systems professor at the University of California, Berkeley, and at that time head of Inktomi, brought the different trade-offs together in a keynote address to the PODC (Principles of Distributed Computing) conference in 2000.1 He presented the CAP theorem, which states that of three properties of shared-data systems—data consistency, system availability, and tolerance to network partition—only two can be achieved at any given time. A more formal confirmation can be found in a 2002 paper by Seth Gilbert and Nancy Lynch.4
A system that is not tolerant to network partitions can achieve data consistency and availability, and often does so by using transaction protocols. To make this work, client and storage systems must be part of the same environment; they fail as a whole under certain scenarios, and as such, clients cannot observe partitions. An important observation is that in larger distributed-scale systems, network partitions are a given; therefore, consistency and availability cannot be achieved at the same time. This means that there are two choices on what to drop: relaxing consistency will allow the system to remain highly available under the partitionable conditions, whereas making consistency a priority means that under certain conditions the system will not be available.
Both options require the client developer to be aware of what the system is offering. If the system emphasizes consistency, the developer has to deal with the fact that the system may not be available to take, for example, a write. If this write fails because of system unavailability, then the developer will have to deal with what to do with the data to be written. If the system emphasizes availability, it may always accept the write, but under certain conditions a read will not reflect the result of a recently completed write. The developer then has to decide whether the client requires access to the absolute latest update all the time. There is a range of applications that can handle slightly stale data, and they are served well under this model.
In principle the consistency property of transaction systems as defined in the ACID properties (atomicity, consistency, isolation, durability) is a different kind of consistency guarantee. In ACID, consistency relates to the guarantee that when a transaction is finished the database is in a consistent state; for example, when transferring money from one account to another the total amount held in both accounts should not change. In ACID-based systems, this kind of consistency is often the responsibility of the developer writing the transaction but can be assisted by the database managing integrity constraints.
Consistency—Client and Server
There are two ways of looking at consistency. One is from the developer/client point of view: how they observe data updates. The second way is from the server side: how updates flow through the system and what guarantees systems can give with respect to updates.
Client-side Consistency
The client side has these components:
- A storage system. For the moment we'll treat it as a black box, but one should assume that under the covers it is something of large scale and highly distributed, and that it is built to guarantee durability and availability.
- Process A. This is a process that writes to and reads from the storage system.
- Processes B and C. These two processes are independent of process A and write to and read from the storage system. It is irrelevant whether these are really processes or threads within the same process; what is important is that they are independent and need to communicate to share information.
Client-side consistency has to do with how and when observers (in this case the processes A, B, or C) see updates made to a data object in the storage systems. In the following examples illustrating the different types of consistency, process A has made an update to a data object: - Strong consistency. After the update completes, any subsequent access (by A, B, or C) will return the updated value.
- Weak consistency. The system does not guarantee that subsequent accesses will return the updated value. A number of conditions need to be met before the value will be returned. The period between the update and the moment when it is guaranteed that any observer will always see the updated value is dubbed the inconsistency window.
- Eventual consistency. This is a specific form of weak consistency; the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme. The most popular system that implements eventual consistency is DNS (Domain Name System). Updates to a name are distributed according to a configured pattern and in combination with time-controlled caches; eventually, all clients will see the update.
The eventual consistency model has a number of variations that are important to consider:
- Causal consistency. If process A has communicated to process B that it has updated a data item, a subsequent access by process B will return the updated value, and a write is guaranteed to supersede the earlier write. Access by process C that has no causal relationship to process A is subject to the normal eventual consistency rules.
- Read-your-writes consistency. This is an important model where process A, after it has updated a data item, always accesses the updated value and will never see an older value. This is a special case of the causal consistency model.
- Session consistency. This is a practical version of the previous model, where a process accesses the storage system in the context of a session. As long as the session exists, the system guarantees read-your-writes consistency. If the session terminates because of a certain failure scenario, a new session needs to be created and the guarantees do not overlap the sessions.
- Monotonic read consistency. If a process has seen a particular value for the object, any subsequent accesses will never return any previous values.
- Monotonic write consistency. In this case the system guarantees to serialize the writes by the same process. Systems that do not guarantee this level of consistency are notoriously hard to program.
A number of these properties can be combined. For example, one can get monotonic reads combined with session-level consistency. From a practical point of view these two properties (monotonic reads and read-your-writes) are most desirable in an eventual consistency system, but not always required. These two properties make it simpler for developers to build applications, while allowing the storage system to relax consistency and provide high availability.
As you can see from these variations, quite a few different scenarios are possible. It depends on the particular applications whether or not one can deal with the consequences.
Eventual consistency is not some esoteric property of extreme distributed systems. Many modern RDBMSs (relational database management systems) that provide primary-backup reliability implement their replication techniques in both synchronous and asynchronous modes. In synchronous mode the replica update is part of the transaction. In asynchronous mode the updates arrive at the backup in a delayed manner, often through log shipping. In the latter mode if the primary fails before the logs are shipped, reading from the promoted backup will produce old, inconsistent values. Also to support better scalable read performance, RDBMSs have started to provide the ability to read from the backup, which is a classical case of providing eventual consistency guarantees in which the inconsistency windows depend on the periodicity of the log shipping.
Server-side ConsistencyOn the server side we need to take a deeper look at how updates flow through the system to understand what drives the different modes that the developer who uses the system can experience. Let's establish a few definitions before getting started:
N = the number of nodes that store replicas of the data
W = the number of replicas that need to acknowledge the receipt of the update before the update completes
R = the number of replicas that are contacted when a data object is accessed through a read operation
If W+R > N, then the write set and the read set always overlap and one can guarantee strong consistency. In the primary-backup RDBMS scenario, which implements synchronous replication, N=2, W=2, and R=1. No matter from which replica the client reads, it will always get a consistent answer. In asynchronous replication with reading from the backup enabled, N=2, W=1, and R=1. In this case R+W=N, and consistency cannot be guaranteed.
The problems with these configurations, which are basic quorum protocols, is that when the system cannot write to W nodes because of failures, the write operation has to fail, marking the unavailability of the system. With N=3 and W=3 and only two nodes available, the system will have to fail the write.
In distributed-storage systems that need to provide high performance and high availability, the number of replicas is in general higher than two. Systems that focus solely on fault tolerance often use N=3 (with W=2 and R=2 configurations). Systems that need to serve very high read loads often replicate their data beyond what is required for fault tolerance; N can be tens or even hundreds of nodes, with R configured to 1 such that a single read will return a result. Systems that are concerned with consistency are set to W=N for updates, which may decrease the probability of the write succeeding. A common configuration for these systems that are concerned about fault tolerance but not consistency is to run with W=1 to get minimal durability of the update and then rely on a lazy (epidemic) technique to update the other replicas.
How to configure N, W, and R depends on what the common case is and which performance path needs to be optimized. In R=1 and N=W we optimize for the read case, and in W=1 and R=N we optimize for a very fast write. Of course in the latter case, durability is not guaranteed in the presence of failures, and if W < (N+1)/2, there is the possibility of conflicting writes when the write sets do not overlap.
Weak/eventual consistency arises when W+R <= N, meaning that there is a possibility that the read and write set will not overlap. If this is a deliberate configuration and not based on a failure case, then it hardly makes sense to set R to anything but 1. This happens in two very common cases: the first is the massive replication for read scaling mentioned earlier; the second is where data access is more complicated. In a simple key-value model it is easy to compare versions to determine the latest value written to the system, but in systems that return sets of objects it is more difficult to determine what the correct latest set should be. In most of these systems where the write set is smaller than the replica set, a mechanism is in place that applies the updates in a lazy manner to the remaining nodes in the replica's set. The period until all replicas have been updated is the inconsistency window discussed before. If W+R <= N, then the system is vulnerable to reading from nodes that have not yet received the updates.
Whether or not read-your-writes, session, and monotonic consistency can be achieved depends in general on the "stickiness" of clients to the server that executes the distributed protocol for them. If this is the same server every time, then it is relatively easy to guarantee read-your-writes and monotonic reads. This makes it slightly harder to manage load balancing and fault tolerance, but it is a simple solution. Using sessions, which are sticky, makes this explicit and provides an exposure level that clients can reason about.
Sometimes the client implements read-your-writes and monotonic reads. By adding versions on writes, the client discards reads of values with versions that precede the last-seen version.
Partitions happen when some nodes in the system cannot reach other nodes, but both sets are reachable by groups of clients. If you use a classical majority quorum approach, then the partition that has W nodes of the replica set can continue to take updates while the other partition becomes unavailable. The same is true for the read set. Given that these two sets overlap, by definition the minority set becomes unavailable. Partitions don't happen frequently, but they do occur between data centers, as well as inside data centers.
In some applications the unavailability of any of the partitions is unacceptable, and it is important that the clients that can reach that partition make progress. In that case both sides assign a new set of storage nodes to receive the data, and a merge operation is executed when the partition heals. For example, within Amazon the shopping cart uses such a write-always system; in the case of partition, a customer can continue to put items in the cart even if the original cart lives on the other partitions. The cart application assists the storage system with merging the carts once the partition has healed.
Amazon's DynamoA system that has brought all of these properties under explicit control of the application architecture is Amazon's Dynamo, a key-value storage system that is used internally in many services that make up the Amazon e-commerce platform, as well as Amazon's Web Services. One of the design goals of Dynamo is to allow the application service owner who creates an instance of the Dynamo storage system—which commonly spans multiple data centers—to make the trade-offs between consistency, durability, availability, and performance at a certain cost point.3
SummaryData inconsistency in large-scale reliable distributed systems has to be tolerated for two reasons: improving read and write performance under highly concurrent conditions; and handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running.
Whether or not inconsistencies are acceptable depends on the client application. In all cases the developer needs to be aware that consistency guarantees are provided by the storage systems and need to be taken into account when developing applications. There are a number of practical improvements to the eventual consistency model, such as session-level consistency and monotonic reads, which provide better tools for the developer. Many times the application is capable of handling the eventual consistency guarantees of the storage system without any problem. A specific popular case is a Web site in which we can have the notion of user-perceived consistency. In this scenario the inconsistency window needs to be smaller than the time expected for the customer to return for the next page load. This allows for updates to propagate through the system before the next read is expected.
The goal of this article is to raise awareness about the complexity of engineering systems that need to operate at a global scale and that require careful tuning to ensure that they can deliver the durability, availability, and performance that their applications require. One of the tools the system designer has is the length of the consistency window, during which the clients of the systems are possibly exposed to the realities of large-scale systems engineering.
References- Brewer, E. A. 2000. Towards robust distributed systems (abstract). In Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (July 16-19, Portland, Oregon): 7
- A Conversation with Bruce Lindsay. 2004. ACM Queue 2(8): 22-33.
- DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (Stevenson, Washington, October).
- Gilbert , S., Lynch, N. 2002. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant Web services. ACM SIGACT News 33(2).
- Lindsay, B. G., Selinger, P. G., et al. 1980. Notes on distributed databases. In Distributed Data Bases, ed. I. W. Draffan and F. Poole, 247-284. Cambridge: Cambridge University Press. Also available as IBM Research Report RJ2517, San Jose, California (July 1979).
7 TrackBacks
Listed below are links to blogs that reference this entry: Eventually Consistent - Revisited.
TrackBack URL for this entry: http://mt.vogels.net/mt-tb.cgi/132
Great article by Werner Vogels . Read More
In BASE - An ACID Alternative fanden sich ja schon Links auf den Artikel von Dan Pritchett und den von Werner Vogels. Einen Folgeartikel gibt es bei Eventually Consistent - Revisited. Den Artikel muß man lesen, indem man ein synchrones Replikationssystem Read More
The Irish Economy’s Rise Was Steep, and the Fall Was Fast - NYTimes.com (tags: financial europe crisis economy ireland) Would You Pay $103,000 for This Arizona Fixer-Upper? - Read More
It has been a couple of months since I wrote about cloud computing and Microsoft’s plans and strategies Read More
CIDR 2009 looks like it was an interesting conference, there were a lot of very interesting papers covering a whole range of data management and retrieval issues. The full list of papers can be browsed online, or downloaded as a zip file. There's plent... Read More
Great article by Werner Vogels . Read More
Excellent article by Werner Vogels on how to achieve eventual consistency. Read More

Reading Werner's articles on AWS is like attending an advanced applied Compsci course or apprenticing with a data center maniac. It's a great time to be alive as the services get better and crash into the economic nosedive with near perfect timing.
Data convergence horizons is one of the great remaining conundrums in the CS world. Early on we wrestled with it when it came to routing tables, then with namespace/DNS, and now it will need to be revisited in Cloud Computing. CC presents a far greater challenge as the amount of data that has to converge is tremendous and its state change continuous. It is amazing to think that certain data may "never" settle in the cloud and that we will have to start thinking about data not only in access but in Time To Live (TTL). Imagine parsing the cloud for certain bit of information gated by a TTL. Will our cloud apps be able to dismiss data based on volatility and a virtual toe tag?
...and we thought photoshop'ed images was a problem. :^)
Thank you for a thought-provoking, deeply useful post on a complex and essential topic. This issue of enterprise-level consistency of data (and of service levels, actually) is only going to get bigger and bigger. It could actually come to be a defining dimension for modern corporations and many kinds of social utility. For example, how will the New York Stock Exchange deliver consistent price and trade data to all investors? Is 'eventual' good enough?
Interestingly, I also see cases in the enterprise where eventual consistency probably was the right approach, but the business opted for a stricter, more expensive, more fragile goal of complete consistency, even though it was unnecessary. My suspicion is that this happens (opting for strict consistency when it's not needed) out of fear or ignorance: fear of taking responsibility for the business's genuine consistency profile, ignorance of what is and is not essential to the business (and the related fear of asking and being revealed as not knowing). These consistency discussions have the potential to be real "emperor's new clothes" moments!
How are Amazon handling indexing/searching of this data? Is it something similar to what CouchDB/Bigtables has done with Map/Reduce? The consistency model of indexed data must somewhat different from the storing model. Would love to get some story on how Amazon tackles this problem.
There are some interesting parallels here to memory consistency models and cache coherence for multiprocessor systems as well (though the issues are different as multiprocessor issues center around performance, scalability, and program semantics).
What has Amazon's experience has been like training developers to use Dynamo. When teaching students about multiprocessor memory consistency models, there is often confusion about some of the differences of what is guaranteed by which model, resulting in either poor performance or buggy programs. Have there been similar issues with Dynamo?
The first formal treatment of the consistency vs. latency (and if one wishes, availability and tolerance to network partitions) in fact appeared in the paper
Hagit Attiya, Jennifer L. Welch: Sequential Consistency versus Linearizability. ACM Trans. Comput. Syst. (TOCS) 12(2):91-122 (1994) - initially published in the SPAA 1991 conference.
It formally proves that for lineariability both read and write operations cannot be faster than the network latency. In sequential consistency, either reads or writes can be faster than the network latency, but at least one of them has to be slow.
In particular, this means that strong consistency conditions cannot be made highly available in partitionable environments, and cannot be implemented in a scalable manner.
I think the notion of eventual consistency is very interesting, but it is perhaps over simplified and over stated. Perhaps I am confused, but I don't see how the most common implementations of high availability fit into this model.
Missing from the eventual consistency discussion is the flexible strategy that most real-world system use. Real world systems don't confine themselves to a static choice of N, W, and R. Instead they dynamically switch these values based on component failures.
Let's look at mirrored disks as an example. Storage systems implement mirrored disks by writing both disks and reading from one. This is N=2, W=2, R=1 in your terminology.
If a disk fails, the mirroring system does not stubbornly stick with the W=2 model and fail all writes until the disk is repaired. Instead the system detects the failure of the disk, and transitions into a new state where reads and writes only use the remaining disk. So the system moves to N=1, W=1, R=1. At a much later time when the disk is replaced and the contents are resynchronized, the storage system transitions back to a state where writes go to both disks.
So there is some level of unavailability of writes, but this is confined to the time it takes to detect the failed disk and agree that no further writes will be issued to the failed disk.
Unavailability of reads is confined to some timeout where the system will reissue reads to the good disk.
Failing over reads is a much more efficient way to implement resiliency than reading from multiple copies since most systems are very read intensive and reading multiple copies (R>1) will cause a huge increase in workload and therefore cost.
You can argue that the mirrored disk subsystem really has N=2, W=1, R=1 since only write needs to succeed in the failed disk scenario, but then in your model this would imply weak/eventual consistency because W+R is not greater than N.
The real-world engineering issue then becomes how quickly the failure can be detected and communicated to all the clients or all the other write stores, and how quickly they can agree on voting the failed write store out of the configuration. In practice, if the failure detection and quorum reconfiguration can be kept relatively fast, then there is no need to give up any consistency at all.
The argument for eventual consistency then becomes that quorum reconfiguration takes too long. Telco's claim that the reconfiguration of their internal databases happen in single digit milliseconds. Storage arrays reconfigure failed disks in seconds or even less than a second. Database reconfiguration happens in seconds or small numbers of minutes. These numbers can all be driven down by engineering and configuration attention.
It seems like the use case of an eventually consistent architecture is really systems that cannot tolerate seconds of downtime and are willing to go through the trouble to build tolerance for inconsistency into the application to avoid seconds of downtime.
Taking a step back, I think the most difficult issue in deploying large and complex systems is often dealing with software bugs, not dealing with component failures. There are existing practical solutions for dealing with component failures that work well in practice and have much lower cost than the eventual consistency model.
It might be that the real source of availability in something like S3, Simple DB, and Dynamo is the fact that they define relatively simple services and these can be more thoroughly tested and debugged than much more complex services.
Another way to achieve "consistency" is to defer the consistency enforcement to a later batch-oriented reconciliation process, which will check if consistency can be achieved, and if not, fire up compensating actions to reestablish consistency.
The model is described here at ...
http://horicky.blogspot.com/2009/01/design-pattern-for-eventual-consistency.html
Rgds, Ricky
Any word on when Amazon CloudFront will support Gzip and the Last-Modified-Header?
Bandwidth on text files will GREATLY be reduced simply by have Gzip support and the Last-Modified-Header.
Werner, very interesting read. If you could add a few words on the relationship between eventual consistency and virtual synchrony, that would be really helpful. Specifically, VS seems to not ever reach consistency in the sense you describe, i.e. that all copies of the data exhibit the same state *eventually*. Instead, VS acknowledges the inherent asynchrony of systems and only requires that consistent state is visible at different clients (and different times) at the same point in the message stream.
Werner, the article on ACM-Queue covers this complex topic in a concise and understandable manner. Considering the complexity, this is not an easy task. Beside the variations of the consistency models, the basic message is that it is up to the development team to hide the trade-off between data consistency, system availability, and tolerance to network partitions. From my perspective, it needs a lot of effort and smart approaches to develop patterns & solutions to tackle this in future for a broader community of architects and developers. This might be especially true as we move more and more applications into the clouds. Greets, Maik
I was wondering if "eventual consistency" could be called "real time weak consistency", as most people in this business know what real time means.