My paper to read this weekend was the Alan Demers' seminal paper on epidemic techniques for database replication. I realized that in 2004, before my Amazon days, I already wrote a blog post about the fundamental publications in the area of epidemics, so this seems like a good moment to revisit that with updated links, etc.
History of Epidemics
In the past 6-8 years we have been using various epidemic techniques in building our reliable and scalable distributed systems with great success. Now that industry is starting to deal with issues of scale that can almost only be solved by using epidemic techniques, it becomes important to produce some basic pointers to the origin of the use of epidemics in distributed systems.
In a nutshell: An epidemic style of communication or state sharing gives you a highly robust medium for distributed interaction. The main advantages are
- Probabilistic model. This doesn't mean that it gives less guarantees than a deterministic model, but that we now have a good framework for reasoning about the spread of information through a system over time.
- A-synchronous communication pattern. Any good epidemic communication system allows you to operate in a 'fire-and -forget' mode, where, even if the initial sender fails, all surviving nodes will receive the update (or none will).
- Autonomous & decentralized actions. Epidemics techniques enable you to take actions based on the data you received without the need for additional communication to reach agreement with your partners; you can take decisions autonomously.
- Robust with respect to message loss & node failures. Once a message has been received by at least one of your peers it is almost impossible to prevent the spread of the information through the system. In the most popular demo I give, the system still operates under 90% message loss with limited or no loss in functionality.
- Rigorous mathematical underpinnings. Finally we have a set protocols where we can use rigorous mathematical techniques to reason about the operation of the protocols under all sorts of conditions
These techniques have a long history in science, but mainly in biology and epidemiology, and in mathematics. The bible of epidemics from a theorectic point of view is:
Epidemic Theory of Infectious Diseases and its Applications N.T.J. Bailey Hafner Press Second Edition, 1957
This is not a computer science text, but it explains the real fundamentals. If you interested in more CS oriented texts than the following list has some general papers that deal with the basics of epidemic communication or 'gossip' and are a good starting point to learn about fundamentals.
I purposely left the Cornell papers off this list, as not to appear too self-serving. I believe the work at Cornell has been ground-breaking in bringing the techniques to a larger CS audience and applying it to building robust and scalable distributed systems. I'll give a reading list of Cornell work in a follow-up posting.
- B. Baker, R. Shostak, Gossips and telephones, Discrete Mathematics 2(1972), pp. 191--193.
- A. Demers, D. Greene, A. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. Epidemic algorithms for replicated database maintenance. In Proc. ACM Symp. on the Principles of Distr. Computing, pages 1--12, August 1987.
- R. Golding and K. Taylor. Group membership in the epidemic style. Technical Report UCSC-CRL-92-13, University of California, Santa Cruz, May 1992.
- D. Agrawal, A. El-Abbadi, and R. Steinke. Epidemic algorithms in replicated databases. In Proc. 16th ACM SIGACT-SIGMOD Symp. Princip. of Database Systems (PODS), Tucson, Arizona, May 1997
- R. Karp, C. Schindelhauer, S. Shenker, B. Vocking, Randomized rumor spreading, Proc. IEEE Symp. Foundations of Computer Science, 2000.