posted on 1999-01-01, 00:00authored byDan Pelleg, Andrew Moore
We present new algorithms for the k-means clustering problem. They use the kd-tree data structure to reduce the large number of
nearest-neighbor queries issued by the traditional algorithm. Sufficient statistics are stored in the nodes of the kd-tree. Then, an analysis of
the geometry of the current cluster centers results in great reduction of the work needed to update the centers. Our algorithms behave exactly as the traditional k-means algorithm. Proofs of correctness
are included. The kd-tree can also be used to initialize the k-means
starting centers efficiently. Our algorithms can be easily extended to
provide fast ways of computing the error of a given cluster assignment,
regardless of the method in which those clusters were obtained. We
also show how to use them in a setting which allows approximate
clustering results, with the benefit of running faster.
We have implemented and tested our algorithms on both real and
simulated data. Results show a speedup factor of up to 170 on real
astrophysical data, and superiority over the naive algorithm on simulated data in up to 5 dimensions. Our algorithms scale well with
respect to the number of points and number of centers, allowing for
clustering with tens of thousands of centers.