Black-Box Problem Localization in Parallel File Systems

Kasick, Michael P.

doi:10.1184/R1/7415978.v1

mkasick_ECE_2015.pdf (2.2 MB)

Black-Box Problem Localization in Parallel File Systems

thesis

posted on 2015-12-01, 00:00 authored by Michael P. Kasick

Parallel file systems target large, high-performance storage systems. Since these storage systems
are comprised of a significant number of components (i.e., hundreds of file servers, thousands of
disks, etc.), they are expected to (and in practice do) frequently exhibit “problems”, from degraded
performance to outright failure of one or more components. The sheer number of components,
and thus, potential problems, makes manual diagnosis of these problems difficult. Of particular
concern are system-wide performance degradations, which may arise from a single misbehaving
component, and thus, pose a challenge for problem localization. Even failure of a redundant component
with a less-significant performance impact is worrisome as it may, in absence of explicit
checks, go unnoticed for some time and increase risk of system unavailability.
As a solution, this thesis defines a novel problem-diagnosis approach, capitalizing upon the
parallel-file-system design criterion of balanced performance, that peer-compares the performance
of system components to localize problems within storage systems running unmodified, “off-theshelf”
parallel file systems. Performed in support of this thesis is a set of laboratory experiments
that demonstrate proof-of-concept of the peer-comparison approach by injecting four realistic
problems into 12-server, test-bench PVFS and Lustre clusters. This thesis is further validated
by taking the diagnosis approach and adapting it to to work on a very-large, production GPFS storage
system consisting of 128 file servers, 32 storage controllers, 1152 disk arrays, and 11,520 total
disks. Presented in a 15-month case study is the problems observed through analysis of 624GB of
instrumentation data, in which a variety of performance-related storage-system problems are localized
and diagnosed, in a matter of hours, as compared to days or longer with manual approaches.

History

Date

2015-12-01

Degree Type

Dissertation

Department

Electrical and Computer Engineering

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Priya Narasimhan

Usage metrics

Keywords

Parallel File Systems

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Black-Box Problem Localization in Parallel File Systems

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports