posted on 1977-01-01, 00:00authored byChristopher R. Palmer, Christos Faloutsos
Data mining in large data sets often requires a sampling or
summarization step to form an in-core representation of the
data that can be processed more efficiently. Uniform random
sampling is frequently used in practice and also frequently
criticized because it will miss small clusters. Many natural
phenomena are known to follow Zipf 's distribution and the
inability of uniform sampling to find small clusters is of
practical concern. Density Biased Sampling is proposed
to probabilistically under-sample dense regions and oversample
light regions. A weighted sample is used to preserve
the densities of the original data. Density biased sampling
naturally includes uniform sampling as a special case. A
memory efficient algorithm is proposed that approximates
density biased sampling using only a single scan of the
data. We empirically evaluate density biased sampling
using synthetic data sets that exhibit varying cluster size
distributions finding up to a factor of six improvement over
uniform sampling.