We focus on the problem of finding patterns across two large, multidimensional datasets. For example,
given feature vectors of healthy and of non-healthy patients, we want to answer the following questions:
“Are the two clouds of points separable?”, “What is the smallest/largest pair-wise distance across the two
datasets?”, “Which of the two clouds does a new point (feature vector) come from?”.
We propose a new tool, the ‘Cross-Cloud plot’, which helps us answer the above questions, and many
more. We present an algorithm to compute the Cross-Cloud plot, which requires only a single pass over
the datasets, thus scaling up to arbitrarily large databases. More importantly, it scales linearly with the
dimensionality, while most other spatial data mining algorithms explode exponentially. We show how to
use our tool for classification, when traditional methods (nearest neighbor, classification trees) may fail. We
also provide a set of rules on how to interpret a Cross-cloud plot, and we apply these rules on multiple,
synthetic and real datasets.