posted on 1982-01-01, 00:00authored byBianca Schroeder, Garth Gibson
With petascale computers only a year or two away there is a pressing need to anticipate and compensate
for a probable increase in failure and application interruption rates. Researchers, designers and integrators
have available to them far too little detailed information on the failures and interruptions that even smaller
terascale computers experience. The information that is available suggests that application interruptions
will become far more common in the coming decade, and the largest applications may surrender large
fractions of the computer’s resources to taking checkpoints and restarting from a checkpoint after an
interruption. This paper reviews sources of failure information for compute clusters and storage systems,
projects failure rates and the corresponding decrease in application effectiveness, and discusses coping
strategies such as application-level checkpoint compression and system level process-pairs fault-tolerance
for supercomputing. The need for a public repository for detailed failure and interruption records is
particularly concerning, as projections from one architectural family of machines to another are widely
disputed. To this end, this paper introduces the Computer Failure Data Repository and issues a call for
failure history data to publish in it.