Carnegie Mellon University
Browse

An Analysis of Traces from a Production MapReduce Cluster (CMU-PDL-09-107)

Download (960.96 kB)
journal contribution
posted on 2009-12-01, 00:00 authored by Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan
MapReduce is a programming paradigm for parallel processing that is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the service providers in the cloud and users: the service provider can use this knowledge to make better scheduling decisions, while the user can learn what aspects of their jobs impact performance. This paper analyzes 10-months of MapReduce logs from the M45 supercomputing cluster which Yahoo! made freely available to select universities for systems research. We characterized resource utilization patterns, job patterns, and sources of failures. We use an instance-based learning technique that exploits temporal locality to predict job completion times from historical data and identify potential performance problems in our dataset.

History

Publisher Statement

All Rights Reserved

Date

2009-12-01

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC