Carnegie Mellon University
Browse

Consistent Weighted Sampling Made Fast, Small, and Easy

Download (620.06 kB)
journal contribution
posted on 1998-02-01, 00:00 authored by Bernhard Haeupler, Mark Manasse, Kunal Talwar

Document sketching using Jaccard similarity has been a workable effective technique in reducing near-duplicates in Web page and image search results, and has also proven useful in file system synchronization, compression and learning applications.
Min-wise sampling can be used to derive an unbiased estimator for Jaccard similarity and taking a few hundred independent consistent samples leads to compact sketches which provide good estimates of pairwise-similarity. Subsequent works extended this technique to weighted sets and show how to produce samples with only a constant number of hash evaluations for any element, independent of its weight. Another improvement by Li et al. shows how to speedup sketch computations by computing many (near-)independent samples in one shot. Unfortunately this latter improvement works only for the unweighted case.


In this paper we give a simple, fast and accurate procedure which reduces weighted sets to unweighted sets with small impact on the Jaccard similarity. This leads to compact sketches consisting of many (near-)independent weighted samples which can be computed with just a small constant number of hash function evaluations per weighted element. The size of the produced unweighted set is furthermore a tunable parameter which enables us to run the unweighted scheme of Li et al. in the regime where it is most efficient. Even when the sets involved are unweighted, our approach gives a simple solution to the densification problem that other works attempted to address.


Unlike previously known schemes, ours does not result in an unbiased estimator. However, we prove that the bias introduced by our reduction is negligible and that the standard deviation is comparable to the unweighted case. We also empirically evaluate our scheme and show that it gives significant gains in computational efficiency, without any measurable loss in accuracy.

History

Date

1998-02-01

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC