Cross-Cloud Plots: Scalable Tools for Spatial and Multidimensional Data Mining
journal contributionposted on 01.01.1978, 00:00 by Agma Traina, Caetano Traina, Christos Faloutsos, Spiros Papadimitriou
We focus on the problem of finding patterns across two large, multidimensional datasets. For example, given feature vectors of healthy and of non-healthy patients, we want to answer the following questions: “Are the two clouds of points separable?”, “What is the smallest/largest pair-wise distance across the two datasets?”, “Which of the two clouds does a new point (feature vector) come from?”. We propose a new tool, the ‘Cross-Cloud plot’, which helps us answer the above questions, and many more. We present an algorithm to compute the Cross-Cloud plot, which requires only a single pass over the datasets, thus scaling up to arbitrarily large databases. More importantly, it scales linearly with the dimensionality, while most other spatial data mining algorithms explode exponentially. We show how to use our tool for classification, when traditional methods (nearest neighbor, classification trees) may fail. We also provide a set of rules on how to interpret a Cross-cloud plot, and we apply these rules on multiple, synthetic and real datasets.