file.pdf (270.18 kB)
Download fileEfficient Similarity Estimation for Systems Exploiting Data Redundancy
journal contribution
posted on 1998-07-24, 00:00 authored by Kanat Tangwongsan, Himabindu Pucha, David G. Andersen, Michael KaminskyMany modern systems exploit data redundancy to
improve efficiency. These systems split data into chunks, generate
identifiers for each of them, and compare the identifiers among
other data items to identify duplicate chunks. As a result, chunk
size becomes a critical parameter for the efficiency of these systems:
it trades potentially improved similarity detection (smaller
chunks) with increased overhead to represent more chunks.
Unfortunately, the similarity between files increases unpredictably
with smaller chunk sizes, even for data of the same type.
Existing systems often pick one chunk size that is “good enough”
for many cases because they lack efficient techniques to determine
the benefits at other chunk sizes. This paper addresses this
deficiency via two contributions: (1) we present multi-resolution
(MR) handprinting, an application-independent technique that
efficiently estimates similarity between data items at different
chunk sizes using a compact, multi-size representation of the
data; (2) we then evaluate the application of MR handprints to
workloads from peer-to-peer, file transfer, and storage systems,
demonstrating that the chunk size selection enabled by MR
handprints can lead to real improvements over using a fixed chunk
size in these systems.