Root Cause

| | Comments (9) | TrackBacks (1)
For those of you interested in the details of last Sunday's Amazon S3 Availability issue you should read the detailed explanation posted at the AWS Status Dashboard. Root cause was single bit corruption of internal state messages that are distributed via Gossip techniques.

1 TrackBacks

Listed below are links to blogs that reference this entry: Root Cause.

TrackBack URL for this entry: http://mt.vogels.net/mt-tb.cgi/122

» Amazon’s recent S3 problems from Savas Weblog

A great read for those, like me, who are into large-scale systems. See what happened in S3’s recent 9 Read More

9 Comments

JungleDave said:

Interesting - I saw this same type of single-bit corruption a few months ago on S3, and posted about it here:
http://developer.amazonwebservices.com/connect/thread.jspa?messageID=86214

Now obviously Sunday's problem was with internal communication rather than server->client downloads, but it seems like the exact same type of error. I wonder if it could have even been the same bad component that introduced the corruption.

Werner Vogels said:

Thanks Dave, I don't know the answer to whether there is a correlation between the two events, but I forwarded your note to the team.

Eelco said:

I hope this explanation will get the same exposure as everybody complaining when S3 was down. It's good to read that a system weakness was identified and solved; In my opinion all systems (but especially those who grow like S3) will experience growth problems at certain stages. It's how (fast) the problem is solved. For us S3 is still production grade reliable.

As an EC2/S3 passioned user and Amazon's staunch defender, it's always exhausting to me to deal with this kind of outages (two until know) - I have to temporary solve the problem while having to convince everyone around me that Amazon continues to be the best choice. Even new potential users may become a little reluctant to migrate their "things" to Amazon.

I truly hope outages with this magnitude will stop happening; I know bad things may always occur, but a 6 hours waiting on a Sunday (or on any other day) is not admissable, as stated also by Amazon's team.

A nice surprise to counteract the consequences of this event, would be the availability of european EC2 or even EC2 persistent storage.

Nick Gerner said:

Thanks for being so transparent about the issues involved. I'm glad that you're doing more than just solving the specific problem and are addressing the meta-issues involved.

Matt Nourse said:

Thanks for the super-interesting post, I found the timeline & postmortem especially helpful. I see that single-bit corruption was the cause of the problem, can you say what caused that corruption (eg bad RAM, disk failure, network corruption, ...) ?

LM said:

I commend the S3 team for their response on this. It's refreshing to see a professional response.

Is it fair to categorize this as a fault in the failure detector? I read the explanation as if the failure detection mechanism propagated incorrect information, but I don't know if that's a fair assessment.

Andrey Kuzmin said:

Great analysis, I won't be surprised if this will (eventually :)) enter CS textbooks on distributed systems as important case study.

As to single bit errors, it looks like impact specifics with gossip has been overlooked (probability of single bit error is low, but epidemic nature of the protocol makes impact high enough to necessitate error detection/correction). One minor tech suggestion here: md5 is still heavy, while crc32(c) with SSE4 becomes a breeze (and should pretty much suffice as gossip message digest).

Matt Johnson said:

"Root cause was single bit corruption of internal state messages that are distributed via Gossip techniques."

This just goes to show how bad gossip can be really damaging. :-)

Thanks for the update and details.

About this Entry

This page contains a single entry by Werner Vogels published on July 25, 2008 5:51 PM.

An Album for Each Year was the previous entry in this blog.

Amazon EBS - Elastic Block Store has launched is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.