Carnegie Mellon University
Browse
mbarnes1_robotics_2019.pdf (2.67 MB)

Learning with Clusters

Download (2.67 MB)
thesis
posted on 2019-01-01, 00:00 authored by Matthew BarnesMatthew Barnes
Clustering, the problem of grouping similar data, has been extensively studied since at least the 1950's. As machine learning becomes more prominent, clustering has evolved from primarily a data analysis tool into an integrated component of complex robotic and machine learning
systems, including those involving dimensionality reduction, anomaly detection, network analysis, image segmentation and classifying groups of data. With this integration into multi-stage systems comes a need to better
understand interactions between pipeline components. Changing parameters of the clustering algorithm will impact downstream components and, quite unfortunately, it is usually not possible to simply backpropagate
through the entire system. Instead, it is common practice to take the output of the clustering algorithm as ground truth at the next module of the pipeline. We show this false assumption causes subtle and dangerous
behavior for even the simplest systems { empirically biasing results by upwards of 25%.
We address this gap by developing scalable estimators and methods to both quantify and compensate the impact of clustering errors on downstream learners. Our work is agnostic to the choice of other components of the
machine learning systems, and requires few assumptions on the clustering algorithm. Theoretical and empirical results demonstrate our methods and estimators are superior to the current naive approaches, which do not
account for clustering errors. We also develop several new clustering algorithms and prove theoretical
bounds for existing algorithms, to be used as inputs to our error-correction methods. Not surprisingly, we fi nd that learning on clusters of data is both theoretically and empirically easier as the number of clustering
errors decreases. Thus, our work is two-fold. We attempt to provide the best clustering possible as well as establish how to effectively learn on inevitably noisy clusters.

History

Date

2019-01-01

Degree Type

  • Dissertation

Department

  • Robotics Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Artur Dubrawski

Usage metrics

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC