February 2007 Archives

On the Reliability of Hard Disks

| | Comments (5)

Today in the opening session of FAST there are two papers on the studies of hard disk reliability. Both these papers present very interesting results that blow away some of the common assumptions in failure modelling of systems.

Bianca Schroeder and Garth Gibson from CMU in their paper Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?, investigate the failure rates of about 100,000 disks from HPC and Internet sites. There is a range of interesting results in this paper, but the ones I think are most important are in the section on the statistical properties of disk failures. In this section Bianca demonstrates that two common assumptions, that disk failures are independent and that the time between failures follows an exponential distribution, are not supported by the collected data. Their data suggest the opposite: disks replacement counts show significant levels of auto correlation, the TBF distribution show much higher variability than an exponential distribution and the expected remaining time until the next disk failure grows with the time it has been since the last failure.

Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso from Google in their paper Failure Trends in a Large Disk Drive Population have interesting results with respect to the factors that influence disk failures. In their study they found that there was no correlation between disk failure rates and utilization, environmental conditions such as temperature, or age. This means that high disk utilization or age of the disk have no significant impact on the probability that it will fail. They did find a strong correlation between manufacturer/model and failure rates. Basically you get what you pay when you talk about disk reliability. Given that disks in general arrive in large batches you may want to take care with how you deploy these disks as you want to reduce the impact of these strong failure correlations.

The only exception to the lack of correlation was that infant mortality rate for disks showed a correlation with high utilization: if a new disk is really crappy you can detect this by putting a high load on it. This could motivate a longer burn-in period to weed these bad disks out. The paper then goes into an interesting discussion of whether SMART parameters can be used as a predictor of impeding disk failure.

Both papers report disk failure rates in the 6%-10% range: in a datacenter with about 100,000 disks you will need to replace up to between 6,000 and 10,000 disks per year. And these rates will only go up as you want to become more cost effective. The failure rates and the reported failure correlations are very important to take into account when you're building cost effective reliable storage for your applications.

(BTW you’re better of letting somebody else worry about all of this, so store your data in S3 :-))

The Conference Season is Opening Up Again

|

I have had the luxury of almost 4 months with any real conferences. Don’t get me wrong it is not that I do not enjoy the public speaking side of my job; it is just that I experience it always as rather disruptive. It is great to focus for a while and get things done. And the holiday shopping season is a good excuse to stay home at any e-commerce organization.

I did do a little outing to Washington D.C. to talk a meeting at the National Academies. A study is being done into very large scale systems development and I was invited to give Amazon’s view on this. It was a very interesting workshop where I was really interested in what the other participants would bring to the table. It is amazing to see how many organizations still believe that they can do a massive scale out in process and technology while still maintaining top-down control. They do not seem to understand how unnatural this strict control is. That the only way to achieve this is by resorting to tricks to limit the events that can happen, but that in any real large system you do not have the luxury of predicting and controlling all the events that can happen. Control is an illusion that might work in small systems but in Real Life you cannot maintain this illusion. A nice book to trigger a discussion on these topics is "Creation: Life and How To Make It” by Steve Grand, the inventor of Creatures.

A few talks were looking forward in a more innovative manner and specifically John Vu, one of the Chief Engineers at Boeing, gave an interesting presentation. He also presented a more directional view of where large scale system need to go and his views are not that different from Amazon’s in that self-organizing techniques are crucial for the future of system development.

But now comes a month that I’ll be on the road again. Next week I’ll be in London for the Future of Web Apps conference to talk about Amazon’s Web Scale Computing Platform and the Web Apps that are enabled by combining multiple of these services.

I will be back in London on March 14 for two presentations at QCon. This is a conference for enterprise architects organized by the same folks who organize JAOO. I had such a great time at JAOO last year that I let myself be tricked into give two presentations: one opening keynote on the Amazon Technology Platform, and one presentation on Availability & Consistency where I talk about the different trade-offs you have to make in state management when building very large scale distributed applications.

Two weeks after that I will be back in the US to talk at ETech about Web Scale Computing architectures.

The Search for Jim Gray Continues

|

As Mike Olsen just wrote on the Tenacious Search weblog, today was the first day that action could be taken on the boats found in the satellite and ER-2 streams. Bad weather has kept any aerial search parties on the ground until now. This morning two planes were dispatched to locations derived from images coordinates combined with drift models. Those planes returned this afternoon without having spotted anything, indicating that the boats in the images were most likely not the Tenacious. For more details read Mike's report.

The search continues.

Half a Million Assignments Completed.

| | Comments (4)

Over 530,000 Mechanical Turk assignments have been completed by more than 12,000 volunteers in the search for Jim Gray. We need a little more of a push and then all the images will have been processed. A team of experts lead by Alex Szalay of John Hopkins University has been working through the thousands of images marked for further investigation. They currently have a set of about 20 images that are being further scrutinized before they will be handed to Coast Guard for determining whether they can take action on them.

Scientists from JPL and CCMOP have developed drift models for the area such that Coast Guard can use that information to determine where objects seen on the satellite images may have drifted to in the past days.

Jim Gray with his colleagues Gianfranco Putzulo and Irving Traiger in the late '70' / early '80s when they did groundbreaking work on concurrency control for databases (image courtesy of Heather Gray)

Update: After 560,000 assignments all images have been reviewed and this Mechanical Turk Hit group is all out of work. This was a tremendous effort by many,many volunteers. The experts continue to review the results to narrow down the possibilities.

Turkers Working Hard on the Search for Jim Gray

|

It is now 3 PM on Sunday afternoon and the group of volunteers in the the search for Jim Gray has worked their way through almost 100,000 assignments since Friday 5 PM. Since then we have seen over 6000 individual workers completing anywhere from 1 to almost a 1000 assignments. And there are still more to go. Over 2000 images were marked for further inspection.

We have set up a second Mechnical Turk HIT process that is being used by satellite image inspection experts who are taking the marked images and correlate them with other data to determine whether this information should be forwarded to the Coast Guard for action.

The volunteers at John Hopkins are processing the high resolution imagery that came available later and that are now ready to be used in the review process.

sathr-ship.jpg

High Altitude Search for Jim Gray

| | Comments (4)

We have now added the data captured by the NASA ER-2 plane yesterday over the ocean area outside of San Francisco. We were very fortunate that this flight was scheduled for yesterday and that the NASA folks were interested in having it capture these images. We have been able to split them just like yesterday’s satellite images and create HITs (Human Interface Tasks) from them.

The images are in this Mechanical Turn HIT Group.

The ER-2 is a high-altitude aircraft that replaced the famous U2 aircrafts of the cold war. For this run it was equipped with near-infrared cameras that make it extremely suitable for finding man-made reflective surfaces in a natural environment. Land will show up as red and sea as dark blue. Any foreign object in the sea would be bright, close to white.

ER-2 Imagedata.jpg

PS: an earlier seperate group we created will be phased out, it was too confusing

Help Find Jim Gray

| | Comments (5) | TrackBacks (1)

Computer science icon Jim Gray mysteriously disappeared after a solo trip with his sail boat outside San Francisco Bay. The coast guard has been searching for 4 days but has not been able to locate anything, not even debris. On Thursday 3 private planes searched through the coastal areas and they also returned unsuccessful.

Through a major effort by many people we were able to have the Digital Globe satellite make a run over the area on Thursday morning and have the data made available publicly. We have split these images into smaller tiles that can be easily scanned visually and stored into the Amazon S3 storage service. We then created tasks for reviewing these images and loaded then into the Amazon Mechanical Turk Service.

This is where you come in. We need your help in reviewing these images to see whether you can locate Jim’s boat in any of these images. Please go to the Amazon Mechanical Turk site and help us find Jim Gray.

The weather conditions were not ideal as some areas were cloudy, but we can still look for him in those places where there is a somewhat clear view. We hope to get more satellite data in the coming days of a wider area. The current images are panchromatic with a 0.82m, and Jim boat would be about 6 pixels in size. Please visit the Amazon Mechanical Turk site for more details.

I have to stress that many individuals and companies are to thank for making this possible; many academics friends relentlessly worked around the clock to get access to the data, many industry friends of Jim functioned as connectors to hook up officials and individuals, and people from NASA, Digital Globe, Microsoft, Google, Oracle, Amazon and others worked hard get to the data collected and available on a very short time scale. The Mechanical Turk team worked deep into the night to make this work.

Now it is your turn, go find Jim Gray.

Update 2/3: We have now added the data from yesterdays NASA ER-2 flight to system. They are in the same Mechanical Turk Group as the satellite images search. Please follow this link to work on these. More details here

Update 2/4: Over 100,000 assigments have been completed, and new high resolution images are coming online. More details here.

About this Archive

This page is an archive of entries from February 2007 listed from newest to oldest.

January 2007 is the previous archive.

April 2007 is the next archive.

Find recent content on the main index or look in the archives to find all content.