Black-Box Problem Localization in Parallel File Systems KasickMichael P. 2015 Parallel file systems target large, high-performance storage systems. Since these storage systems<br>are comprised of a significant number of components (i.e., hundreds of file servers, thousands of<br>disks, etc.), they are expected to (and in practice do) frequently exhibit “problems”, from degraded<br>performance to outright failure of one or more components. The sheer number of components,<br>and thus, potential problems, makes manual diagnosis of these problems difficult. Of particular<br>concern are system-wide performance degradations, which may arise from a single misbehaving<br>component, and thus, pose a challenge for problem localization. Even failure of a redundant component<br>with a less-significant performance impact is worrisome as it may, in absence of explicit<br>checks, go unnoticed for some time and increase risk of system unavailability.<br>As a solution, this thesis defines a novel problem-diagnosis approach, capitalizing upon the<br>parallel-file-system design criterion of balanced performance, that peer-compares the performance<br>of system components to localize problems within storage systems running unmodified, “off-theshelf”<br>parallel file systems. Performed in support of this thesis is a set of laboratory experiments<br>that demonstrate proof-of-concept of the peer-comparison approach by injecting four realistic<br>problems into 12-server, test-bench PVFS and Lustre clusters. This thesis is further validated<br>by taking the diagnosis approach and adapting it to to work on a very-large, production GPFS storage<br>system consisting of 128 file servers, 32 storage controllers, 1152 disk arrays, and 11,520 total<br>disks. Presented in a 15-month case study is the problems observed through analysis of 624GB of<br>instrumentation data, in which a variety of performance-related storage-system problems are localized<br>and diagnosed, in a matter of hours, as compared to days or longer with manual approaches.