Understanding Failures in Petascale Computers

Schroeder, Bianca; Gibson, Garth

doi:10.1184/R1/6611285.v1

file.pdf (712.45 kB)

Understanding Failures in Petascale Computers

journal contribution

posted on 1982-01-01, 00:00 authored by Bianca Schroeder, Garth Gibson

With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interruptions that even smaller terascale computers experience. The information that is available suggests that application interruptions will become far more common in the coming decade, and the largest applications may surrender large fractions of the computer’s resources to taking checkpoints and restarting from a checkpoint after an interruption. This paper reviews sources of failure information for compute clusters and storage systems, projects failure rates and the corresponding decrease in application effectiveness, and discusses coping strategies such as application-level checkpoint compression and system level process-pairs fault-tolerance for supercomputing. The need for a public repository for detailed failure and interruption records is particularly concerning, as projections from one architectural family of machines to another are widely disputed. To this end, this paper introduces the Computer Failure Data Repository and issues a call for failure history data to publish in it.

History

Publisher Statement

Date

1982-01-01

Usage metrics

Keywords

computer sciences

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Understanding Failures in Petascale Computers

History

Publisher Statement

Date

Usage metrics

Categories

Keywords

Licence

Exports