posted on 1979-10-01, 00:00authored bySpiros Papadimitriou, Hiroyuki Kitagawa, Phillip B Gibbons, Christos Faloutsos
Outlier detection is an integral part of data mining and has attracted much attention recently [BKNS00,
JTH01, KNT00]. In this paper, we propose a new method for evaluating outlier-ness, which we call
the Local Correlation Integral (LOCI). As with the best previous methods, LOCI is highly effective
for detecting outliers and groups of outliers (a.k.a. micro-clusters). In addition, it offers the following
advantages and novelties: (a) It provides an automatic, data-dictated cut-off to determine whether a
point is an outlier—in contrast, previous methods force users to pick cut-offs, without any hints as to
what cut-off value is best for a given dataset. (b) It can provide a LOCI plot for each point; this plot
summarizes a wealth of information about the data in the vicinity of the point, determining clusters,
micro-clusters, their diameters and their inter-cluster distances. None of the existing outlier-detection
methods can match this feature, because they output only a single number for each point: its outlierness
score. (c) Our LOCI method can be computed as quickly as the best previous methods. (d)
Moreover, LOCI leads to a practically linear approximate method, aLOCI (for approximate LOCI),
which provides fast highly-accurate outlier detection. To the best of our knowledge, this is the first
work to use approximate computations to speed up outlier detection.
Experiments on synthetic and real world data sets show that LOCI and aLOCI can automatically detect
outliers and micro-clusters, without user-required cut-offs, and that they quickly spot both expected
and unexpected outliers.