Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? (CMU-PDL-06-111)

Schroeder, Bianca; Gibson, Garth A.

doi:10.1184/R1/6619535.v1

file.pdf (271.92 kB)

Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? (CMU-PDL-06-111)

journal contribution

posted on 2006-09-01, 00:00 authored by Bianca Schroeder, Garth A. Gibson

Component failure in large-scale IT installations such as cluster supercomputers or internet service providers is becoming an ever larger problem as the number of processors, memory chips and disks in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from five systems in production use at three organizations, two supercomputing sites and one internet service provider. About 70,000 disks are covered by this data, some for an entire lifetime of 5 years. All disks were high-performance enterprise disks (SCSI or FC), whose datasheet MTTF of 1,200,000 hours suggest a nominal annual failure rate of at most 0.75%. We find that in the field, annual disk replacement rates exceed 1%, with 2-4% common and up to 12% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF, and that it can be quite variable installation to installation. We also find evidence that failure rate is not constant with age, and that rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after 5 years of use. In our statistical analysis of the data, we find that time between failure is not well modeled by an exponential distribution, since the empirical distribution exhibits higher levels of variability and decreasing hazard rates. We also find significant levels of correlation between failures, including autocorrelation and long-range dependence.

History

Publisher Statement

Date

2006-09-01

Usage metrics

Keywords

Disk failure data failure rate lifetime data disk reliability mean time to failure (MTTF)annualized failure rate (AFR)

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? (CMU-PDL-06-111)

History

Publisher Statement

Date

Usage metrics

Categories

Keywords

Licence

Exports