file.pdf (511.84 kB)
Understanding and Maturing the Data-Intensive Scalable Computing Storage Substrate
journal contribution
posted on 1999-06-01, 00:00 authored by Garth Gibson, Bin Fan, Swapnil Patil, Milo Polte, Wittawat Tantisiriroj, Lin XiaoModern science has available to it, and is more productively
pursued with, massive amounts of data, typically either
gathered from sensors or output from some simulation or
processing. The table below shows a sampling of data sets
that a few scientists at Carnegie Mellon University have
available to them or intend to construct soon. Data Intensive
Scalable Computing (DISC) couples computational resources
with the data storage and access capabilities to
handle massive data science quickly and efficiently. Our
topic in this extended abstract is the effectiveness of the data
intensive file systems embedded in a DISC system. We are
interested in understanding the differences between data
intensive file system implementations and high performance
computing (HPC) parallel file system implementations.
Both are used at comparable scale and speed. Beyond feature
inclusions, which we expect to evolve as data intensive
file systems see wider use, we find that performance does
not need to be vastly different. A big source of difference is
seen in their approaches to data failure tolerance: replication
in DISC file systems versus RAID in HPC parallel file systems.
We address the inclusion of RAID in a DISC file
system to dramatically increase the effective capacity available
to users.