Disk-Adaptive Redundancy: Tailoring Data Redundancy to Disk-Reliability Heterogeneity in Cluster Storage Systems

Kadekodi, Saurabh

doi:10.1184/R1/14461768.v1

CMU-CS-20-142.pdf (3.38 MB)

Disk-Adaptive Redundancy: Tailoring Data Redundancy to Disk-Reliability Heterogeneity in Cluster Storage Systems

thesis

posted on 2021-04-23, 20:18 authored by Saurabh KadekodiSaurabh Kadekodi

Large-scale cluster storage systems contain hundreds-of-thousands of hard disk drives in their primary storage tier. Since the clusters are not built all at once, there is significant heterogeneity among the disks in terms of their capacity,

make/model, firmware, etc. Redundancy settings for data reliability are generally configured in a “one-scheme-fits-all” manner assuming that this heterogeneous disk population has homogeneous reliability characteristics. In reality we observe that different disk groups fail differently, causing clusters to have significantly high disk-reliability heterogeneity. This dissertation paves the way for exploiting

disk reliability heterogeneity to tailor redundancy settings to different disk groups for cost-effective, and arguably safer redundancy in large-scale cluster storage systems.

Our first contribution is an in-depth data-driven analysis of disk reliability of over 5.3 million disks across over 60 makes/models in three large production environments (Google, NetApp and Backblaze). We observe that the strongest disks can be over an order of magnitude more reliable than the weakest disks in the same storage cluster. This makes today’s static redundancy schemes selection

either insufficient, or wasteful, or both. We identify and quantify the opportunity of achieving lower storage cost along with increased data protection by means of

disk-adaptive redundancy. Our next contribution is designing the heterogeneity-aware redundancy tuner (HeART), an online tuning tool that guides selection of different redundancy settings for long-term data reliability, based on observed reliability properties of each disk group. By processing disk failure data over time, HeART identifies

the boundaries and steady-state failure rate for each deployed disk group by make/model. Using this information, HeART suggests the most space-efficient redundancy option allowed that will achieve the specified target data reliability. HeART is evaluated using longitudinal disk failure logs from a large production cluster with over 100K disks. Guided by HeART, the cluster could meet target

data reliability levels with much fewer disks than one-scheme-for-all approaches: 11–16% fewer compared to erasure codes like 10-of-14 or 6-of-9 and up to 33%

fewer compared to 3-way replication. While HeART promises substantial space-savings, it is rendered unusable in production settings of real-world clusters, because the IO load of transitions between redundancy schemes overwhelms the storage infrastructure (termed transition

overload). Analysis on Google’s cluster traces shows transition overload consuming 100% of the cluster IO bandwidth for weeks together, making transition overload a show-stopper for practical disk-adaptive redundancy. Building on the insights drawn from our data-driven analysis, Pacemaker is the next contribution of this dissertation; a low-overhead disk-adaptive redundancy orchestrator

that realizes HeART’s dream in practice. Pacemaker mitigates transition overload by (1) proactively organizing data layouts to make future transitions efficient, (2) initiating transitions proactively in a manner that avoids urgency while not compromising on space-savings, and (3) designing more IO efficient redundancy transitioning mechanisms. Evaluation of Pacemaker with traces from four large

(110K–450K disks) production clusters (three from Google and one from Backblaze) shows that the transition IO requirement decreases to never needing more than 5% cluster IO bandwidth (only 0.2–0.4% on average). Pacemaker achieves this while providing overall space-savings of 14–20% (compared to using a static

6-of-9 scheme) and never leaving data under-protected.

The final contribution of this dissertation is the design and implementation of disk-adaptive redundancy techniques from Pacemaker in the widely used Hadoop Distributed File System (HDFS). This prototype re-purposes HDFS’s existing architectural components for disk-adaptive redundancy, and successfully leverages the robustness and maturity of the existing code. Moreover, the components that

are re-purposed are fundamental to any distributed storage system’s architecture, and thus, this prototype also serves as a guideline for future systems that wish to support disk-adaptive redundancy.

History

Date

2020-12-03

Degree Type

Dissertation

Department

Computer Science

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Gregory Ganger Rashmi Vinayak

Usage metrics

Keywords

reliability durability fault-tolerance redundancy distributed storage systems cluster storage systems disks HDD erasure code replication heterogeneity

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Disk-Adaptive Redundancy: Tailoring Data Redundancy to Disk-Reliability Heterogeneity in Cluster Storage Systems

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports