Foundations of Clustering: New Models and Algorithms
In this dissertation, we study clustering, one of the most common unsupervised learning problems. This dissertation covers recent developments in both clustering theory and machine learning practice. In particular, it explores how to bridge the gap between theory and practice by making them benefit from each other. Many clustering algorithms are developed in an application-oriented way lacking the guidance of theory. For some clustering problems it is hard to mathematically characterize what is being optimized. The arising needs in the ML/AI community, such as fairness and scalability, also require updates in current problem formulations. The first few chapters of this dissertation lay the theoretical foundation for multiple clustering problems. We first establish the formal optimization framework. Such a framework gives us conceptual understanding of the problems and becomes the basis for optimization and algorithm design. We then discuss the performance of existing approaches and come up with new algorithms beating the state-of-the-art. Empirical evaluations also verify that the new algorithms perform better in both quality and efficiency, showing it is beneficial to view these problems through the lens of theory. We study one classic clustering problem: hierarchical clustering. Unlike other more well-formulated clustering problems such as k-means, the theoretical study of hierarchical clustering has kicked off recently. The first chapter focuses on new objective function design for hierarchical clustering on point inputs in a Euclidean space. It provides theoretical guarantees for a popular heuristic. The second chapter studies how to incorporate fairness into the hierarchical clustering optimization framework. It defines fair hierarchical clustering trees and discusses how to design algorithms that find fair solutions for previous hierarchical clustering objectives established by the community. Surprisingly, in this setting fairness could be imposed at minimal loss in clustering solution performance. The focus is then shifted to speeding up famous clustering algorithms in scenarios where they are known to be inefficient. We consider average-linkage. Building the hierarchical tree from bottom to top, it is one of the most commonly used hierarchical clustering algorithm.
It is known to scale poorly to large datasets, as it requires iteratively searching for two clusters with the smallest average pairwise distance among the current clustering, whichis time consuming. To speed up the cluster search, we introduce a new technique named “clustering embedding”. It maps clusters into points in Euclidean space. The points are then used as surrogates for the clusters, enabling us to apply Approximate Nearest Neighbors (ANN) techniques. We reduce the previous quadratic bound on running time to only slightly super-linear. New challenges could also be imposed by a new data input format other than the conventional sample-feature matrix. Consider relational database, one of the most common data storage format that is highly compact. The naive way of running conventional ML algorithms requires converting the given database into the matrix format. This could cause the input size to grow exponentially. Instead, we design algorithms that could work directly on the relational databases without recovering the sample-feature matrix. We give such algorithms for the classical k-means problem. We show how to adapt the famous k-means++ algorithm and find a constant approximation for the optimal k-means solution. On the other hand, this dissertation shows how we can rethink the design of combinatorial algorithms by augmenting the algorithms with learned oracles using a data driven approach. Traditional algorithm performance design and analysis is often bottle-necked by worst-case instances. Meanwhile, the practitioners often have historical records about past data and solutions. Training ML oracles on the data can give us knowledge about the current problem instance, and this knowledge could be used to help the algorithm go beyond the “hurdles” of hard instances. We call such knowledge “predictions”. The remaining chapters focus on proposing feasible predictions in the contexts of different clustering problems and discussing how to design better algorithms utilizing these predictions. We revisit the scalable hierarchical clustering algorithm designs explored in previous chapters and extend it to inputs in more general metric spaces. In Euclidean spaces we design cluster embeddings and couple it with ANN search to efficiently identify clusters to merge. However, the ANN technique is not known to exist for general metrics. We show how a
proxy metric, which approximates the original metric, could be used to support the ANN search with minimal loss in hierarchical clustering performance. Finally, we consider correlation clustering. Given a set of points along with recommendations whether each pair of points should be placed in the same cluster or into separate clusters, the goal is to cluster the points to minimize disagreements from the recommendations. We study this problem in the online setting, where points arrive one at a time, and upon arrival the algorithm must make an irrevocable cluster assignment decision. There is a simple lower bound that rules out any algorithm with a non-trivial competitive ratio. We propose using a small, randomized subset of nodes to help making online clustering decisions. Upon the arrival of a new node, the algorithm can check whether it is recommended to be in thesame/different cluster(s) with the reference set. We prove that the famous Pivot algorithm performs well in this setting. Moreover, the performance is robust to adversarial perturbations of the reference set.
History
Date
2022-05-12Degree Type
- Dissertation
Department
- Tepper School of Business
Degree Name
- Doctor of Philosophy (PhD)