Eventually Consistent
Recently there has been a lot of discussion about the concept of eventual consistency in the context of data replication. In this positing I would like to try to collect some of the principles and abstractions related to large scale data replication and the trade-offs between high-availability and data consistency. I consider this work-in-progress as I don’t expect to get every definition crisp the first time.
There are two ways of looking at consistency. One is from the developer / client point of view; how they observe data updates. The second way is from the server side; how updates flow through the system and what guarantees systems can give with respect to updates.
Historical
In an ideal world there would only be one consistency model; when an update is made all observers will see that update. The first time this surfaced as difficult to achieve was in the database systems in the late seventies. The best “period piece” on this topic is by Bruce Lindsay, et al, “Notes on Distributed Databases”, Research Report RJ2571(33471), IBM Research, July 1979. The fundamental principles for database replication are laid out in these notes and number of techniques are discussed deal with achieve consistency. Many of these techniques try to achieve distribution transparency; that to the user of the system it appears as if there is only one system instead of a number of collaborating systems. Many of these systems took the approach that it was better to fail the system than to break this transparency.
In the mid-nineties, with the rise of larger internet systems, these practices were revisited. At that time one started to give consideration to the idea that maybe availability was the more important property of these systems. As a result people were struggling what it should be traded-off against. Eric Brewer, Berkeley systems professor and at that time head of Inktomi, brought the different trade-offs together in a keynote to the PODC conference in 2000. Eric presented the CAP theorem, which states that of three properties of shared-data systems; data consistency, system availability and tolerance to network partition one can only achieve two at any given time. A more formal confirmation can be found in a paper by Gilbert and Lynch.
A system that is not tolerant to network partitions can achieve data consistency and availability, and often does so by using transaction protocols. To make this work, client and storage systems are part of the same environment and they fail as a whole under certain scenarios and as such clients cannot observe partitions. An important observation is that in larger distributed scale systems, network partitions are a given and as such consistency and availability cannot be achieved at the same time. This means that one has two choices on what to drop; relaxing consistency will allow the system to remain highly available under the partitionable conditions and prioritizing consistency means that under certain conditions the system will not be available.
Both require the client developer to be aware of what the system is offering. If the system emphasizes consistency, the developer has to deal with the fact that system may not be available to take for example a write. If this write fails because of system unavailability the developer will have to deal with what to do with the data to be written. If the system emphasizes availability, it may always accept the write but under certain conditions a read will not reflect the result of a recently completed write. The developer then has to make a decision about whether the client requires access to the absolute latest update all the time. There is a range of applications that can handle slightly stale data and they are served well under this model.
In principle the consistency property of transaction systems as defined in the ACID properties is a different kind of consistency guarantee. In ACID consistency relates to the guarantee that when a transaction is finished the database is in a consistent state; for example when transferring money one from account to another the total amount held in both accounts should not change. In ACID based systems this kind of consistency is often the responsibility of the developer writing the transaction, but can be assisted by the database managing integrity constraints.
Client side consistency
At the client side there are four components:
- A storage system. For the moment we’ll treat it as a black box, but if you want you should assume that under the covers it is something big and distributed and built to guarantee durability and availability.
- Process A. A process that writes to and reads from the storage system.
- Process B & C. Two processes independent of process A that also write to and read from the storage system. It is irrelevant whether these are really processes or threads within the same process, important is that they are independent and need to communicate the share information.
At the client side consistency has to do with how and when an observer (in this case processes A, B or C) sees updates made to a data object in the storage systems. In the following examples Process A has made an update to a data object.
- Strong consistency. After the update completes any subsequent access (by A, B or C) will return the updated value.
- Weak consistency. The system does not guarantee that subsequent accesses will return the updated value. A number of conditions need to be met before the value will be returned. Often this condition is the passing of time. The period between the update and the moment when it is guaranteed that any observer will always see the updated value is dubbed the inconsistency window.
- Eventual consistency. The storage system guarantees that if no new updates are made to the object eventually (after the inconsistency window closes) all accesses will return the last updated value. The most popular system that implements eventual consistency is DNS, the domain name system. Updates to a name are distributed according to a configured pattern and in combination with time controlled caches, eventually of client will see the update.
There are a number of variations on the eventual consistency model that are important to consider:
- Causal consistency. If process A has communicated to process B that it has updated a data item, a subsequent access by process B will return the updated value and a write is guaranteed to supersede the earlier write. Access by process C that has no causal relationship to process A is subject to the normal eventual consistency rules.
- Read-your-writes consistency. This is an important model where process A after it has updated a data item always accesses the updated value and never will see an older value. This is a special case of the causal consistency model.
- Session consistency. This is a practical version of the previous model, where a process accesses the storage system in the context of a session. As long as the session exists, the system guarantees read-your-writes consistency. If the session terminates because of certain failure scenarios a new session needs to be created, and the guarantees do not overlap the sessions.
- Monotonic read consistency. If a process has seen a particular value for the object any subsequent accesses will never return any previous values.
- Monotonic write consistency. In this case the system guarantees to serialize the writes by the same process. Systems that do not guarantee this level of consistency are notoriously hard to program.
A number of these properties can be combined. For example one can get monotonic reads combined with session level consistency. From a practical point of view these two properties (monotonic reads and read-your-writes) are most desirable in an eventually consistent system, but not always required.
As you can see from the different variations there are quite a few different scenarios possible. It depends on the particular applications whether or not one can deal with the consequences.
Eventually consistency is not some esoteric property of extreme distributed systems. Many modern RDBMS systems that provide primary-backup reliability implement their replication techniques in both synchronous and asynchronous modes. In synchronous mode the replica update is part of the transaction, in asynchronous mode the updates arrive at the backup in a delayed manner, often through log shipping. In the last mode if the primary fails before the logs are shipped, reading from the promoted backup will produce old, inconsistent values. Also to support better scalable read performance RDBMS systems have start to provide reading from the backup, which is a classical case of providing eventual consistency guarantees, where the inconsistency windows depends on the periodicity of the log shipping.
Server-side consistency.
We need to establish a few definitions before we can get started:
- N – the number of nodes that store a replicas of the data
- W – the number of replicas that need to acknowledge the receipt of the update before the update completes
- R – the number of replicas that are contacted when a data object is accessed through a read operation
If W+R > N than the write set and the read set always overlap and one can guarantee strong consistency. In the primary-backup RDBMS scenario which implements synchronous replication N=2, W=2 and R=1. No matter from which replica the client reads, it will always get a consistent answer. In the asynchronous replication case with reading from the backup enabled N=2, W=1 and R=1. In this case R+W=N and consistency cannot be guaranteed.
The problems with these configurations, which are basic quorum protocols, is that when because of failures the system cannot write to W nodes, the write operation has to fail, marking the unavailability of the system. With N=3 and W=3 and only 2 nodes available the system will have to fail the write.
In distributed storage systems that need to address high-performance and high-availability the number of replicas is in general higher than 2. Systems that focus solely on fault-tolerance often use N=3 (with W=2 and R=2 configurations). Systems that need to serve very high read loads often replicate their data beyond what is required for fault-tolerance, where N can be tens or even hundreds of nodes and with R configured to 1 such that a single read will return a result. For systems that are concerned about consistency they set W=N for updates, but which may decrease the probability of the write succeeding. A common configuration for systems in this configuration that are concerned about fault-tolerance, but not consistency, is to run with W=3 to get basic durability of the update and then rely on a lazy (epidemic) technique to update the other replicas.
How to configure N, W and R depends on what the common case is and which performance path needs to be optimized. In R=1 and N=W we optimize for the read case and in the W=1 and R=N we would optimize for a very fast write. Of course in the latter case, durability is not guaranteed in the presence of failures, and if W < (N+1)/2 there is the possibility of conflicting writes because write sets do not overlap.
Weak/eventually consistency arises when W+R <= N, meaning that there is no overlap in the read and write set. If this configuration is deliberate and not based on a failure case, than it hardly makes sense to set R to anything else but 1. There are two very common cases where this happens: the first is the massive replication for read scaling mentioned earlier and the second is where data access is more complicated. In a simple key-value model it is easy to compare versions to determine which is the latest value but in systems that return sets of objects is it more difficult to determine what the correct latest set should be. In these systems where the write set is smaller than the replica set, there is a mechanism in place that in a lazy manner applies the updates to the remaining nodes in the replicas set. The period until all replicas have been updated is the inconsistency window discussed before. If W+R <= N than the system is vulnerable to reading from nodes that have not yet received the updates.
Whether or not read-your-write, session and monotonic consistency can be achieved depends in general on the “stickiness” of clients to the server that executes the distributed protocol for them. If this is the same server every time than it is relatively easy to guarantee read-your-writes and monotonic reads. This makes it slightly harder to manage load balancing and fault-tolerance, but it is a simple solution. Using sessions, which are sticky, makes this explicit and provides an exposure level that clients can reason about.
Sometimes read-your-writes and monotic reads are implemented by the client. By adding versions on writes, the client discards reads of values with versions that precede the last seen version.
Partitions happen when some nodes in the system cannot reach other nodes, but all can be reached by clients. If you use a classical majority quorum approach, than the partition that has W nodes of the replica set can continue to take updates while the other partition becomes unavailable. The same for the read set. Given that these two sets overlap, by definition the minority set becomes unavailable. Partitions don’t happen that frequently, but they do occur, between datacenters as well inside datacenters.
Summary
Inconsistency can be tolerated for two reasons: for improving read and write performance under highly concurrent conditions and for handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running.
Whether or not inconsistencies are acceptable depends on the client application. A specific popular case is a website scenario in which we can have the notion of user-perceived consistency; the inconsistency window needs to be smaller than the time expected for the customer to return for the next page load. This allows for updates to propagate through the system, before the next read is expected.
In this posting I have tried to collect some of the definitions and principles around consistency and availability models. I expect this list to be incomplete, maybe even incorrect or lacking in subtlety. I will continue to update this until it is more useful and complete.
6 TrackBacks
Listed below are links to blogs that reference this entry: Eventually Consistent.
TrackBack URL for this entry: http://mt.vogels.net/mt-tb.cgi/109
These are the web's most talked about URLs on Fri 21st Dec 2007. The current winner is .. Read More
I just read the most excellent entry from Werner Vogels on Eventual Consistency , and the different ways one can consider it in distributed data systems. Recommended reading! I hope to be able to include explicit support for the different things he descr Read More
Quickies IV Read More
Here are some recent writings I had hoped to blog about but failed to: Hunter R. Rawlings III, "Information, Knowledge, Authority, and Democracy" Michael Francis Booth, "Social Networks: Stop Designing Out The Fun" Jaron Lanier, "Long Live Closed-Sourc... Read More
Amazon´s SimpleDB is an exciting new player in the database world. It´s free, it´s online, it´s not relational Read More
There are a whole bunch of interesting posts / stuff I find on the net that I bookmark on del.icio.us Read More

Hi Werner, what a excellent article. I'm trying to dive into this topic (eventual consistency) a little bit deeper. Paxos and related areas of research helped me in getting a basic understanding in the beginning. Your post makes a lot of question marks disappear (in my thinking). Anyway, the Acid link leads to the chemical substance. Try: http://en.wikipedia.org/wiki/ACID
In the fourth paragraph,"participation" should be "partition".
Nice article. It is getting forwarded to a bunch of people.
Maik, Walter - I appreciate the corrections. Thanks!
Your paper on Dynamo was an eye-opener for me. This was a great read. Thank you.
Some corrections:
In this case the systems guarantees -> systems guarantee
We need to establish a few definitions befor we can get started -> before
the number of replicas that is contacted -> are contacted
Inconsistency can tolerated -> can be tolerated
handling partitions cases -> partition
Wow, embarassement. Thanks for all your help (also in email) with my second language. I rushed this out before some travels tomorrow, but I should have taken some more proofreading time.
Thanks!
Interesting article.
Quick typo correction:
eventually of client will see the update -> all clients
I like to read things on eventual consistency and i found your article
very interesting...
However, i disagree on your definition of eventual consistency.
In http://www.faqs.org/rfcs/rfc677.html, Jonhson and Thomas defined
EC as follows:
Consistency
The extent to which the copies of the database can be kept
"identical" must be examined. Because of the inherent delay in
communications between DBMPs, it is impossible to guarantee that the
data bases are identical at all times. Rather, our goal is to guarantee
that the copies are "consistent" with each other. By this we mean that
given a cessation of update activity to any entry, and enough time for
each DBMP to communicate with all other DBMPs, then the state of that
entry (its existence and value) will be identical in all copies of the
database.
And i think that it is a little different from : "The storage system guarantees that if no new updates are made to the object eventually (after the inconsistency window closes) all accesses will return the last updated value".
You definition is maybe correct with only update operations. If you consider insert or delete operations, your definition need to be completed...
IMHO, a system is EC if all copies are identical when the system is
idle. With this definition, Causal Consistency is not a subset of EC.
It is possible to have a causal schedules that leads to divergent states...
Maybe the best example for EC is not DNS, but Usenet...
Thank you for the in-depth discussion and research references. I wish to one day "rush out" stuff this great. :-)
@"As you can see from the different variations there are quite a few different scenarios possible. It depends on the particular applications whether or not one can deal with the consequences."
Can 'deal with'... or 'can accept'? Dealing with the problem makes it sound like it comes down to cost-effectiveness rather than business necessity.
Here are some typographical corrections:
for example when transferring money one from account to another the total amount held in both accounts should not change. -> from one account to another
If W+R > N than the write set and the read set always overlap and one can guarantee strong consistency. -> then
If this configuration is deliberate and not based on a failure case, than it hardly makes sense to set R to anything else but 1. -> then
If W+R then
If this is the same server every time than it is relatively easy to guarantee read-your-writes and monotonic reads. -> then
If you use a classical majority quorum approach, than the partition that has W nodes of the replica set can continue to take updates while the other partition becomes unavailable. -> then
This is very interesting, and I want to hear more. For example, if you have multiple clients updating the same data with different values, how do you know which value is 'correct' value? Do you use some kind of counter? Or do you timestamp the data?
Hello.
Thank you for the excellent summary.
I made a Japanese translation of this article,
which is available at
http://www.hyuki.com/yukiwiki/wiki.cgi?EventuallyConsistent
If you have any trouble, please let me know.
Hi, Werner. Interesting post! Can I ask a few questions?
Eventual consistency is a special case of weak consistency, correct? You separated strong, weak, and eventual consistency into three categories, but the third is a special case of the second, is it not?
Also, the definition of eventual consistency seems to suggest that any second update in the inconsistency window of an update will reset the inconsistency window of the first update to be the same as the second update, so I could see either the original value, first update value, or second update value even after the first update inconsistency window has expired if the second update's inconsistency window has not yet expired. Is that correct? Or are clients guaranteed to see the value of either the first or the second update once the inconsistency window of the first update expires (that is, all updates have a fixed inconsistency time)?
On the inconsistency windows, are those usually fixed or given with probabilities? That is, do we guarantee consistency after 10 seconds or that we will have consistency 99.9% of the time after 10 seconds? The former seems much harder to achieve than the latter.
It seems like session consistency would be sufficient for most web applications. It seems that it also would allow setting N high, W low (or even W=1 for some apps that can tolerate data loss), and R=1, improving performance. Do you agree? I am guessing you do not since you do not spend much time talking about session consistency. Could you elaborate on session consistency and, perhaps, discuss specific scenarios where it is insufficient? Clearly, there is the case where a session is lost but, even when that happens, it seems like you may be able to make clients wait past a short (less than 2 second) inconsistency window when they lose a session and start a new one, avoiding reading stale data. Is that not correct?
If there are no other conditions in weak consistency besides the passing of time then are the concepts "weak consistency" and "eventual consistency" equivalent? Am I right in thinking that they are?
Also, what are examples of conditions in weak consistency that are not time related that need to be met before the reader reads the updated value?
OOTO - on vacation - replies in the new year.
Related (intro) blog entry:
http://www.michaelnygard.com/blog/2007/11/architecting_for_latency.html
which comments on:
http://www.infoq.com/articles/pritchett-latency
I like to print these and read them on the train, but neither FF nor IE seems to be able to handle this. A printable view would be nice...
Hi,
A nice article, thanks Vogels!
I tend to agree with the definition of eventual consistency given by Vogels. A popular eventual consistency system - Bayou (http://www2.parc.com/csl/projects/bayou/pubs/dataeng-98/DataEngineeringDec98.frame.pdf)
defines EC in a similar way - if application stop issuing writes, then all replicas will reach a consistent states - note that it requires replicas to apply the writes in the same order. Bayou uses timestamps to order the writes - this can be a bit tricky due to the absence of a global clock in distributed systems.
Another interesting paper on the Swarm system also exploits weak consistency across a WAN. It carries an interesting discussion - that there are different dimensions to consistency - Vogels discussed only the timeliness guarantees (and also possibly on update ordering dimension - writes have to be ordered the same way across replicas). The other dimension include - update ordering, failure handling (relative to other replicas), visibiliy and isolation (session semantics or not), concurrency (parallelism w.r.t reads and writes).