Carnegie Mellon University
Browse

Understanding Training Data in Large-Scale Machine Learning

Download (2.35 MB)
thesis
posted on 2024-10-23, 18:31 authored by Sang Keun ChoeSang Keun Choe

 As the capabilities of large-scale machine learning (ML) systems rapidly improve, reliable development & deployment of these systems is increasingly gaining attention. Based on the premise that ML models are in large part reflections of their training data, this thesis aims to achieve reliable ML by developing principled, scalable, and operationalizable frameworks for understanding the influence of (each) training data on the final ML models. Specifically, in the development phase, such frameworks can be used for a wide range of data curation tasks, including noisy label detection, data pruning, and data reweighting. In the deployment phase, we can enable data attribution/valuation to address newly emerging societal challenges, such as data author compensation and data copyright infringement detection. 

Towards this goal, we propose extending the inductive programming framework that likens training data (or inductive biases) in ML to source code in traditional software, noting that source code plays a pivotal role in understanding software. In particular, we design a unit data structure specific to inductive biases in ML, and establish a mathematical structure by projecting each unit onto a gradient space with a local metric defined by the Fisher information matrix (FIM). We show that various mathematical operations on this space, such as inner product and norm, can be used to estimate the influence of each inductive bias on the final behavior of ML models and uncertainty in its predictions, thereby laying the foundation to control and interpret black-box ML models, similar to source code in traditional software. Main challenges in operationalizing the above framework involve scalability, algorithmic instability, and programming interface, due to the inherently high-dimensional and stochastic nature of a gradient space. In this thesis, we specifically choose two tasks of automated data optimization and data influence analyses, respectively for controllability and interpretability, and address these challenges by co-designing ML algorithms, systems, and software. Our systems for both tasks are open-sourced to facilitate research in this direction.   

History

Date

2024-09-06

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Eric P. Xing