Black-Box Problem Localization in Parallel File Systems

2015-12-01T00:00:00Z (GMT) by Michael P. Kasick
Parallel file systems target large, high-performance storage systems. Since these storage systems
are comprised of a significant number of components (i.e., hundreds of file servers, thousands of
disks, etc.), they are expected to (and in practice do) frequently exhibit “problems”, from degraded
performance to outright failure of one or more components. The sheer number of components,
and thus, potential problems, makes manual diagnosis of these problems difficult. Of particular
concern are system-wide performance degradations, which may arise from a single misbehaving
component, and thus, pose a challenge for problem localization. Even failure of a redundant component
with a less-significant performance impact is worrisome as it may, in absence of explicit
checks, go unnoticed for some time and increase risk of system unavailability.
As a solution, this thesis defines a novel problem-diagnosis approach, capitalizing upon the
parallel-file-system design criterion of balanced performance, that peer-compares the performance
of system components to localize problems within storage systems running unmodified, “off-theshelf”
parallel file systems. Performed in support of this thesis is a set of laboratory experiments
that demonstrate proof-of-concept of the peer-comparison approach by injecting four realistic
problems into 12-server, test-bench PVFS and Lustre clusters. This thesis is further validated
by taking the diagnosis approach and adapting it to to work on a very-large, production GPFS storage
system consisting of 128 file servers, 32 storage controllers, 1152 disk arrays, and 11,520 total
disks. Presented in a 15-month case study is the problems observed through analysis of 624GB of
instrumentation data, in which a variety of performance-related storage-system problems are localized
and diagnosed, in a matter of hours, as compared to days or longer with manual approaches.