PLFS: A Checkpoint Filesystem for Parallel Applications

Bent, John; Gibson, Garth; Grider, Gary; McClelland, Ben; Nowoczynski, Paul; Nunez, James; Polte, Milo; Wingate, Meghan

doi:10.1184/R1/6608453.v1

file.pdf (387.43 kB)

PLFS: A Checkpoint Filesystem for Parallel Applications

journal contribution

posted on 2005-08-01, 00:00 authored by John Bent, Garth Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez, Milo Polte, Meghan Wingate

Parallel applications running across thousands of processors must protect themselves from inevitable system failures. Many applications insulate themselves from failures by checkpointing. For many applications, checkpointing into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system which is optimized for large, aligned writes to non-shared files. To address this fundamental mismatch, we have developed a virtual parallel log structured file system, PLFS. PLFS remaps an application’s preferred data layout into one which is optimized for the underlying file system. Through testing on PanFS, Lustre, and GPFS, we have seen that this layer of indirection and reorganization can reduce checkpoint time by an order of magnitude for several important benchmarks and real applications without any application modification.

History

Date

2005-08-01

Usage metrics

Keywords

High performance computing parallel computing check- pointing parallel file systems and IO

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

PLFS: A Checkpoint Filesystem for Parallel Applications

History

Date

Usage metrics

Categories

Keywords

Licence

Exports