mbarnes1_robotics_2019.pdf (2.67 MB)
Download fileLearning with Clusters
Clustering, the problem of grouping similar data, has been extensively studied since at least the 1950's. As machine learning becomes more prominent, clustering has evolved from primarily a data analysis tool into an integrated component of complex robotic and machine learning
systems, including those involving dimensionality reduction, anomaly detection, network analysis, image segmentation and classifying groups of data. With this integration into multi-stage systems comes a need to better
understand interactions between pipeline components. Changing parameters of the clustering algorithm will impact downstream components and, quite unfortunately, it is usually not possible to simply backpropagate
through the entire system. Instead, it is common practice to take the output of the clustering algorithm as ground truth at the next module of the pipeline. We show this false assumption causes subtle and dangerous
behavior for even the simplest systems { empirically biasing results by upwards of 25%.
We address this gap by developing scalable estimators and methods to both quantify and compensate the impact of clustering errors on downstream learners. Our work is agnostic to the choice of other components of the
machine learning systems, and requires few assumptions on the clustering algorithm. Theoretical and empirical results demonstrate our methods and estimators are superior to the current naive approaches, which do not
account for clustering errors. We also develop several new clustering algorithms and prove theoretical
bounds for existing algorithms, to be used as inputs to our error-correction methods. Not surprisingly, we find that learning on clusters of data is both theoretically and empirically easier as the number of clustering
errors decreases. Thus, our work is two-fold. We attempt to provide the best clustering possible as well as establish how to effectively learn on inevitably noisy clusters.
systems, including those involving dimensionality reduction, anomaly detection, network analysis, image segmentation and classifying groups of data. With this integration into multi-stage systems comes a need to better
understand interactions between pipeline components. Changing parameters of the clustering algorithm will impact downstream components and, quite unfortunately, it is usually not possible to simply backpropagate
through the entire system. Instead, it is common practice to take the output of the clustering algorithm as ground truth at the next module of the pipeline. We show this false assumption causes subtle and dangerous
behavior for even the simplest systems { empirically biasing results by upwards of 25%.
We address this gap by developing scalable estimators and methods to both quantify and compensate the impact of clustering errors on downstream learners. Our work is agnostic to the choice of other components of the
machine learning systems, and requires few assumptions on the clustering algorithm. Theoretical and empirical results demonstrate our methods and estimators are superior to the current naive approaches, which do not
account for clustering errors. We also develop several new clustering algorithms and prove theoretical
bounds for existing algorithms, to be used as inputs to our error-correction methods. Not surprisingly, we find that learning on clusters of data is both theoretically and empirically easier as the number of clustering
errors decreases. Thus, our work is two-fold. We attempt to provide the best clustering possible as well as establish how to effectively learn on inevitably noisy clusters.
History
Date
2019-01-01Degree Type
- Dissertation
Department
- Robotics Institute
Degree Name
- Doctor of Philosophy (PhD)