Today in the opening session of FAST there are two papers on the studies of hard disk reliability. Both these papers present very interesting results that blow away some of the common assumptions in failure modelling of systems.
Bianca Schroeder and Garth Gibson from CMU in their paper Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?, investigate the failure rates of about 100,000 disks from HPC and Internet sites. There is a range of interesting results in this paper, but the ones I think are most important are in the section on the statistical properties of disk failures. In this section Bianca demonstrates that two common assumptions, that disk failures are independent and that the time between failures follows an exponential distribution, are not supported by the collected data. Their data suggest the opposite: disks replacement counts show significant levels of auto correlation, the TBF distribution show much higher variability than an exponential distribution and the expected remaining time until the next disk failure grows with the time it has been since the last failure.
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso from Google in their paper Failure Trends in a Large Disk Drive Population have interesting results with respect to the factors that influence disk failures. In their study they found that there was no correlation between disk failure rates and utilization, environmental conditions such as temperature, or age. This means that high disk utilization or age of the disk have no significant impact on the probability that it will fail. They did find a strong correlation between manufacturer/model and failure rates. Basically you get what you pay when you talk about disk reliability. Given that disks in general arrive in large batches you may want to take care with how you deploy these disks as you want to reduce the impact of these strong failure correlations.
The only exception to the lack of correlation was that infant mortality rate for disks showed a correlation with high utilization: if a new disk is really crappy you can detect this by putting a high load on it. This could motivate a longer burn-in period to weed these bad disks out. The paper then goes into an interesting discussion of whether SMART parameters can be used as a predictor of impeding disk failure.
Both papers report disk failure rates in the 6%-10% range: in a datacenter with about 100,000 disks you will need to replace up to between 6,000 and 10,000 disks per year. And these rates will only go up as you want to become more cost effective. The failure rates and the reported failure correlations are very important to take into account when you're building cost effective reliable storage for your applications.
(BTW you’re better of letting somebody else worry about all of this, so store your data in S3 :-))