On the Reliability of Hard Disks

| | Comments (5)

Today in the opening session of FAST there are two papers on the studies of hard disk reliability. Both these papers present very interesting results that blow away some of the common assumptions in failure modelling of systems.

Bianca Schroeder and Garth Gibson from CMU in their paper Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?, investigate the failure rates of about 100,000 disks from HPC and Internet sites. There is a range of interesting results in this paper, but the ones I think are most important are in the section on the statistical properties of disk failures. In this section Bianca demonstrates that two common assumptions, that disk failures are independent and that the time between failures follows an exponential distribution, are not supported by the collected data. Their data suggest the opposite: disks replacement counts show significant levels of auto correlation, the TBF distribution show much higher variability than an exponential distribution and the expected remaining time until the next disk failure grows with the time it has been since the last failure.

Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso from Google in their paper Failure Trends in a Large Disk Drive Population have interesting results with respect to the factors that influence disk failures. In their study they found that there was no correlation between disk failure rates and utilization, environmental conditions such as temperature, or age. This means that high disk utilization or age of the disk have no significant impact on the probability that it will fail. They did find a strong correlation between manufacturer/model and failure rates. Basically you get what you pay when you talk about disk reliability. Given that disks in general arrive in large batches you may want to take care with how you deploy these disks as you want to reduce the impact of these strong failure correlations.

The only exception to the lack of correlation was that infant mortality rate for disks showed a correlation with high utilization: if a new disk is really crappy you can detect this by putting a high load on it. This could motivate a longer burn-in period to weed these bad disks out. The paper then goes into an interesting discussion of whether SMART parameters can be used as a predictor of impeding disk failure.

Both papers report disk failure rates in the 6%-10% range: in a datacenter with about 100,000 disks you will need to replace up to between 6,000 and 10,000 disks per year. And these rates will only go up as you want to become more cost effective. The failure rates and the reported failure correlations are very important to take into account when you're building cost effective reliable storage for your applications.

(BTW you’re better of letting somebody else worry about all of this, so store your data in S3 :-))

5 Comments

Brad Murray Author Profile Page said:

These aren't the most scientific numbers, but I have a mix of 400 and 500GB Hitachi SATA drives, 148 in total in my datacenter and I have had 1 failure in just over one year. That drive was about 3 weeks old at the time it failed. I also have about 125 Maxtor drives in an EMC Clariion that is 3 years old. I get around 6 failures per year on that box which is close to the numbers you are reporting. Still, I am working on moving the whole thing over to S3 once the kinks get worked out.

Brad Murray Author Profile Page said:

In typical fashion, after opening my mouth one of my 500GB SATAs failed while I was asleep last night. Well, I was past the first year already anyway.

Thorsten Author Profile Page said:

Thanks for an interesting post and good links! Your last sentence ("you’re better of letting somebody else worry about all of this, so store your data in S3") does beg an obvious question: how does Amazon manage S3 and how are the issue mentioned in the papers overcome?

Ian Ringrose Author Profile Page said:

The interesting next question is what can the software do so all the disk don’t fail at the same time?

E.g. should raid systems keep there spared disks powered down, and only power them up for a short time each time to test them?

Do we need to stop using raid and replace it with a system that does not need all the disks to be the same? Then whenever more disks are added to a data center, half of them can be randomly swapped with disks that are already on servers.

Simple admin solutions may also help, e.g. whenever a disk fail, randomly choose another disk of the same type and age and also replace it.

Brad Murray Author Profile Page said:

RAID6 is a better solution since you are rarely in critical mode with it since there are two parity blocks per stripe set. My current cards only support RAID5, but I use hot spares which are usually at least 1/3 regenerated by the time I even see the alert for the failure. RAID6 is like RAID5 where the hot spare is always spun up for you.

About this Entry

This page contains a single entry by Werner Vogels published on February 14, 2007 9:42 AM.

The Conference Season is Opening Up Again was the previous entry in this blog.

Myths of Innovation is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.