These are the old pages from the weblog as they were published at Cornell. Visit www.allthingsdistributed.com for up-to-date entries.

May 06, 2004

Epidemic Computing at Cornell

In a posting last week I gave a bit of history of the use of Epidemic techniques in computing and why we are so interested in them when it comes to distributed computing. It is remarkable that many of the techniques that make systems scalable in general, are at the a part of epidemics. Of course biological epidemics scale really well and are very robust: once a few subjects are infected it is almost impossible to stop the spread, even if you isolate the original source.

One of the other cool properties of epidemics that I didn't mention in the original posting is they work better and better at large scale. This means that having to scale up your system is no longer a burden, but it becomes an advantage and you can deliver on the true promise of scale-out: more nodes means a more robust system.

Various groups at Cornell have been looking at problems using similar techniques. For example Jon Kleinberg and David Kempe's work on gossip protocols & small world phenomena, spread of influence in social networks, or David's work with Alin Dobra and Johannes Gehrke on computing aggregate information using gossip.  And of course since the return of Alan Demers we have on of the founding father of epidemics techniques in house.

In my group we have been using epidemic techniques to build reliable distributed systems that are robust to all sort of perturbations and as such are better capable of handling the issues related to scalability. The following is a short list of papers on some of the key topics. There are many more related papers, on detailed techniques or on applications to areas such as peer-to-peer, see the spinglas publications page for a good overview.

  • Katherine Guo, et al. GSGC: an efficient gossip-style garbage collection scheme for scalable reliable multicast.
    Katie was one of the first to apply gossip techniques to address the issue of guaranteeing message delivery to all participants in a conversation even if the original sender fails.
  • Robbert van Renesse, Yaron Minsky, and Mark Hayden, A gossip-style failure detection service.
    I consider this to be one of the most important ideas to come out of the use of epidemics in distributed computing: how to use gossip to derive information about whether a participant in a conversation has failed or not. Shows all the asynchrony and autonomy powers of epidemics.
  • Robbert. van Renesse. Scalable and Secure Resource Location.
    This is  first paper on the epidemic state sharing system, which later became Astrolabe
  • Ken Birman, et al. Bimodal multicast (TOCS 1999)
     The power combination of unreliable multicast combined with an epidemic style reliability mechanism results in one of the most robust reliable multicast protocols ever built.
  • Öznur Özkasap, et al.. Efficient buffering in reliable multicast protocols.
    This shows another side of the epidemics: how you can use the rigorous mathematical underpinnings to reason about knowledge in a distributed system and how you can relax in this case global buffering requirements to still achieve full reliability.
  • Robbert van Renesse, Ken Birman and Werner Vogels, Astrolabe: A Robust and Scalable Technology for Distributed Monitoring, Management and Data Mining, (TOCS 2003)
    This is paper that pulls together many of the epidemic techniques in an extremely powerful system for monitoring and control massively distributed systems.

  • Posted by Werner Vogels at May 6, 2004 10:59 AM
    TrackBacks

    Interesting Readings
    Excerpt:
    Weblog: Tiago Pascoal's WebLog
    Tracked: May 15, 2004 06:24 AM
    Interesting Readings on scalability
    Excerpt:
    Weblog: Tiago Pascoal's WebLog
    Tracked: May 15, 2004 07:39 AM
    Comments

    I must say that you guys really do rock.

    One of the other cool properties of epidemics that I didn't mention in the original posting is they work better and better at large scale. This means that having to scale up your system is no longer a burden, but it becomes an advantage and you can deliver on the true promise of scale-out: more nodes means a more robust system.

    <heh> Can't wait until you unleash this beast.

    Many thanks for the paper refs.

    Posted by: Hal on May 7, 2004 12:53 AM