Carnegie Mellon University
Browse

Democratizing Foundation Models Through Robust Understanding and Learning with Imperfect Data

Download (9.23 MB)
thesis
posted on 2025-05-14, 18:37 authored by Hao ChenHao Chen

The field of generative AI has witnessed unprecedented growth, driven by advancements in large foundation models. However, this progress has created a critical bottleneck: the development of these models has become increasingly expensive and exclusive due to their reliance on meticulously curated large-scale datasets. Large-tech companies invest enormous resources to collect and clean massive datasets for pretraining and adaptation. This data-centric barrier not only widens the gap between resource-intensive corporate research and academic efforts but also increases the black-box nature of foundation models. Even for large companies, this approach to data curation ultimately consumes most of the high-quality data and eventually fails to scale. Meanwhile, vast amounts of imperfect data – containing noise, weak signals, and biases – remain readily available and inexpensive, but training models on such data has traditionally led to inferior performance compared to models trained on meticulously curated datasets. As AI increasingly shapes our world and continues to expand in scale, we face a fundamental challenge: How can we transform this abundance of imperfect data from a limitation into an opportunity to democratize AI development? This data-centric democratization would not only make AI development more accessible, but would also lead to more robust and adaptable foundation models that better reflect the complexity and diversity of real-world data.

My thesis addresses this challenge by pioneering Imperfection-Aware AI – a paradigm shift that enables AI systems to work effectively with inexpensive and imperfect data resources. Training foundation models with diverse and real-world imperfect data naturally exposes them to the complexity and nuance of human-generated content, making them better reflect and handle real-world variations. By transforming the traditionally perceived weaknesses of training with imperfect data into strengths, we can foster more robust, ethical, and universally adaptable AI systems that are accessible to researchers and developers worldwide. In pursuit of this vision, my research agenda focuses on data-centric methods to understand the physics of foundation models trained on imperfect data, mitigate potential adverse effects from training with imperfection, and leverage various imperfect data and labels for more robust learning.

• Exploring Effects of Pre-Training Data Imperfections: I investigate how different types of data imperfections (e.g., corruption, bias, diversity) influence the physics of foundation models during pre-training. My work was among the first to reveal that these models require data imperfections in pre-training to generalize better. This finding fundamentally transforms how we view imperfect data – from an obstacle to be eliminated to a valuable resource that can democratize AI development while improving model robustness. • Understanding and Mitigating Catastrophic Inheritance: While leveraging imperfect data can facilitate AI democratizing, we must also understand its limitations. I introduced the concept of Catastrophic Inheritance, a new research direction that examines how imperfections in pre-training data inherit to and affect downstream tasks. I build open-source evaluation tools and develop fine-tuning methods to mitigate detrimental effects, ensuring that models trained on imperfect data maintain reliability and generalization capabilities.

• Leveraging Imperfect Data and Labels for Transfer Learning: To make AI development truly accessible, I develop robust learning methods that effectively utilize readily available imperfect data and labels to facilitate efficient model adaptation at downstream tasks. Notably, my work proposed the first general framework capable of universally handling more than 14 types of weak and noisy supervision, making scalable transfer learning possible in practical scenarios where only mixed imperfect data are available.

This thesis endeavors to provide valuable insights into understanding data imperfection in the era of large foundation models, bring the techniques of learning with imperfect data into practice, and inspire further research in related fields.

History

Date

2025-03-12

Degree Type

  • Dissertation

Thesis Department

  • Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Raj Bhiksha

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC