All Things Distributed
I wrote a first version of this posting on consistency models in December 2007, but I was never happy with it as it was written in haste and the topic is important enough to receive a more thorough treatment. ACM Queue asked me to revise it for use in their magazine and I took the opportunity to improve the article.
I posted an update to this article in December 2008 under the tile Eventually Consistent - Revisted. - please read that article instead of this one. I am leaving this one here for transparency/historical reasons and because the comments helped me improve the article. For which I am grateful
Recently there has been a lot of discussion about the concept of eventual consistency in the context of data replication. In this positing I would like to try to collect some of the principles and abstractions related to large scale data replication and the trade-offs between high-availability and data consistency. I consider this work-in-progress as I don't expect to get every definition crisp the first time.
There are two ways of looking at consistency. One is from the developer / client point of view; how they observe data updates. The second way is from the server side; how updates flow through the system and what guarantees systems can give with respect to updates.
In an ideal world there would only be one consistency model; when an update is made all observers will see that update. The first time this surfaced as difficult to achieve was in the database systems in the late seventies. The best "period piece" on this topic is by Bruce Lindsay, et al, "Notes on Distributed Databases", Research Report RJ2571(33471), IBM Research, July 1979. The fundamental principles for database replication are laid out in these notes and number of techniques are discussed deal with achieve consistency. Many of these techniques try to achieve distribution transparency; that to the user of the system it appears as if there is only one system instead of a number of collaborating systems. Many of these systems took the approach that it was better to fail the system than to break this transparency.
In the mid-nineties, with the rise of larger internet systems, these practices were revisited. At that time one started to give consideration to the idea that maybe availability was the more important property of these systems. As a result people were struggling what it should be traded-off against. Eric Brewer, Berkeley systems professor and at that time head of Inktomi, brought the different trade-offs together in a keynote to the PODC conference in 2000. Eric presented the CAP theorem, which states that of three properties of shared-data systems; data consistency, system availability and tolerance to network partition one can only achieve two at any given time. A more formal confirmation can be found in a paper by Gilbert and Lynch.
A system that is not tolerant to network partitions can achieve data consistency and availability, and often does so by using transaction protocols. To make this work, client and storage systems are part of the same environment and they fail as a whole under certain scenarios and as such clients cannot observe partitions. An important observation is that in larger distributed scale systems, network partitions are a given and as such consistency and availability cannot be achieved at the same time. This means that one has two choices on what to drop; relaxing consistency will allow the system to remain highly available under the partitionable conditions and prioritizing consistency means that under certain conditions the system will not be available.
Both require the client developer to be aware of what the system is offering. If the system emphasizes consistency, the developer has to deal with the fact that system may not be available to take for example a write. If this write fails because of system unavailability the developer will have to deal with what to do with the data to be written. If the system emphasizes availability, it may always accept the write but under certain conditions a read will not reflect the result of a recently completed write. The developer then has to make a decision about whether the client requires access to the absolute latest update all the time. There is a range of applications that can handle slightly stale data and they are served well under this model.
In principle the consistency property of transaction systems as defined in the ACID properties is a different kind of consistency guarantee. In ACID consistency relates to the guarantee that when a transaction is finished the database is in a consistent state; for example when transferring money one from account to another the total amount held in both accounts should not change. In ACID based systems this kind of consistency is often the responsibility of the developer writing the transaction, but can be assisted by the database managing integrity constraints.
Client side consistency
At the client side there are four components:
At the client side consistency has to do with how and when an observer (in this case processes A, B or C) sees updates made to a data object in the storage systems. In the following examples Process A has made an update to a data object.
There are a number of variations on the eventual consistency model that are important to consider:
A number of these properties can be combined. For example one can get monotonic reads combined with session level consistency. From a practical point of view these two properties (monotonic reads and read-your-writes) are most desirable in an eventually consistent system, but not always required.
As you can see from the different variations there are quite a few different scenarios possible. It depends on the particular applications whether or not one can deal with the consequences.
Eventually consistency is not some esoteric property of extreme distributed systems. Many modern RDBMS systems that provide primary-backup reliability implement their replication techniques in both synchronous and asynchronous modes. In synchronous mode the replica update is part of the transaction, in asynchronous mode the updates arrive at the backup in a delayed manner, often through log shipping. In the last mode if the primary fails before the logs are shipped, reading from the promoted backup will produce old, inconsistent values. Also to support better scalable read performance RDBMS systems have start to provide reading from the backup, which is a classical case of providing eventual consistency guarantees, where the inconsistency windows depends on the periodicity of the log shipping.
We need to establish a few definitions before we can get started:
If W+R > N than the write set and the read set always overlap and one can guarantee strong consistency. In the primary-backup RDBMS scenario which implements synchronous replication N=2, W=2 and R=1. No matter from which replica the client reads, it will always get a consistent answer. In the asynchronous replication case with reading from the backup enabled N=2, W=1 and R=1. In this case R+W=N and consistency cannot be guaranteed.
The problems with these configurations, which are basic quorum protocols, is that when because of failures the system cannot write to W nodes, the write operation has to fail, marking the unavailability of the system. With N=3 and W=3 and only 2 nodes available the system will have to fail the write.
In distributed storage systems that need to address high-performance and high-availability the number of replicas is in general higher than 2. Systems that focus solely on fault-tolerance often use N=3 (with W=2 and R=2 configurations). Systems that need to serve very high read loads often replicate their data beyond what is required for fault-tolerance, where N can be tens or even hundreds of nodes and with R configured to 1 such that a single read will return a result. For systems that are concerned about consistency they set W=N for updates, but which may decrease the probability of the write succeeding. A common configuration for systems in this configuration that are concerned about fault-tolerance, but not consistency, is to run with W=3 to get basic durability of the update and then rely on a lazy (epidemic) technique to update the other replicas.
How to configure N, W and R depends on what the common case is and which performance path needs to be optimized. In R=1 and N=W we optimize for the read case and in the W=1 and R=N we would optimize for a very fast write. Of course in the latter case, durability is not guaranteed in the presence of failures, and if W < (N+1)/2 there is the possibility of conflicting writes because write sets do not overlap.
Weak/eventually consistency arises when W+R <= N, meaning that there is no overlap in the read and write set. If this configuration is deliberate and not based on a failure case, than it hardly makes sense to set R to anything else but 1. There are two very common cases where this happens: the first is the massive replication for read scaling mentioned earlier and the second is where data access is more complicated. In a simple key-value model it is easy to compare versions to determine which is the latest value but in systems that return sets of objects is it more difficult to determine what the correct latest set should be. In these systems where the write set is smaller than the replica set, there is a mechanism in place that in a lazy manner applies the updates to the remaining nodes in the replicas set. The period until all replicas have been updated is the inconsistency window discussed before. If W+R <= N than the system is vulnerable to reading from nodes that have not yet received the updates.
Whether or not read-your-write, session and monotonic consistency can be achieved depends in general on the "stickiness" of clients to the server that executes the distributed protocol for them. If this is the same server every time than it is relatively easy to guarantee read-your-writes and monotonic reads. This makes it slightly harder to manage load balancing and fault-tolerance, but it is a simple solution. Using sessions, which are sticky, makes this explicit and provides an exposure level that clients can reason about.
Sometimes read-your-writes and monotic reads are implemented by the client. By adding versions on writes, the client discards reads of values with versions that precede the last seen version.
Partitions happen when some nodes in the system cannot reach other nodes, but all can be reached by clients. If you use a classical majority quorum approach, than the partition that has W nodes of the replica set can continue to take updates while the other partition becomes unavailable. The same for the read set. Given that these two sets overlap, by definition the minority set becomes unavailable. Partitions don't happen that frequently, but they do occur, between datacenters as well inside datacenters.
Inconsistency can be tolerated for two reasons: for improving read and write performance under highly concurrent conditions and for handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running.
Whether or not inconsistencies are acceptable depends on the client application. A specific popular case is a website scenario in which we can have the notion of user-perceived consistency; the inconsistency window needs to be smaller than the time expected for the customer to return for the next page load. This allows for updates to propagate through the system, before the next read is expected.
In this posting I have tried to collect some of the definitions and principles around consistency and availability models. I expect this list to be incomplete, maybe even incorrect or lacking in subtlety. I will continue to update this until it is more useful and complete.