Incremental MapReduce Computations
Distributed processing of large data sets is an area that received much attention from researchers and practitioners over the last few years. In this context, several proposals exist that leverage the observation that data sets evolve over time, and as such there is often a substantial overlap between the input to consecutive runs of a data processing job. This allows the programmers of these systems to devise an efficient logic to update the output upon an input change. However, most of these systems lack compatibility existing models and require the programmer to implement an application-specific dynamic algorithm, which increases algorithm and code complexity.
In this chapter, we describe our previous work on building a platform called Incoop, which allows for running MapReduce computations incrementally and transparently. Incoop detects changes between two files that are used as inputs to consecutive MapReduce jobs, and efficiently propagates those changes until the new output is produced. The design of Incoop is based on memoizing the results of previously run tasks, and reusing these results whenever possible. Doing this efficiently introduces several technical challenges that are overcome with novel concepts, such as a large-scale storage system that efficiently computes deltas between two inputs, a Contraction phase to break up the work of the Reduce phase, and an affinitybased scheduling algorithm. This chapter presents the motivation and design of Incoop, as well as a complete evaluation using several application benchmarks. Our results show significant performance improvements without changing a single line of application code