These are the old pages from the weblog as they were published at Cornell. Visit for up-to-date entries.

October 10, 2003

CPA & Failure Detection

I have dug myself a bit deeper into the the what is the motivation for the renewed interested into fault-tolerant computing, this time under the name of  Continuous Processing Architectures. I haven't seen any truly new arguments, from lets say 5 years ago.

My general position is that the need for highly available architectures has crept up on many enterprises, up to the point where any service degradation through the failure of compute services or even a modest performance degradation, can bring parts of the enterprise to a halt. As such they need to be scalable, highly available, with robust performance guarantees, etc. etc. the remainder of the argument is well know. What is new however is that we are slowly getting more information about the rates at which failures occur in modern architectures, the ripple effects they cause and financial consequences of these failures. These numbers are source for a difference posting, but let it be that general service availability is a lot less than 99.9%.

What has changed in the past years is the scope of the problem when dealing large enterprise applications. A few years ago they were still rather monolithic or would consist of a few (distributed) components. Availability tracking was relatively simple. Today large enterprise applications consistent of hundreds of components, executing in parallel, making availability and performance monitoring a very complex activity. A future with web service integration, will change the whole scope once again. Available tools are limited in scope, and very expensive, and the automatic response handling is very limited..

For the High Performance Transaction Systems workshop where I was asked to present some work in the context of Continuous Processing Architectures, I will focus on failure detection and performance monitoring, and revisit some of the techniques we developed in the past, but that don't work well in production setting, or don't scale to reasonable numbers under production conditions. I will revisit the probabilistic and epidemic techniques that are robust and scalable, and present the multi-level failure detection technology that I developed for the Galaxy Cluster Management system that targets management of large datacenters that are geographically distributed.

Posted by Werner Vogels at October 10, 2003 05:17 PM