Carnegie Mellon University
svmehta_phd_lti_2023.pdf (3.57 MB)

Efficient Lifelong Learning in Deep Neural Networks: Optimizing Architecture, Training, and Data

Download (3.57 MB)
posted on 2024-02-23, 14:53 authored by Sanket Vaibhav MehtaSanket Vaibhav Mehta

 The prevalent machine learning paradigm involves training a separate model for every new task given a static dataset. In contrast, humans accumulate knowledge over time, and the lifelong learning paradigm seeks to emulate this process by enabling systems to learn continuously from a stream of tasks, retaining past knowledge for efficient future learning. This paradigm also offers advantages such as avoiding periodic model training, potentially reducing computational and energy requirements, and promoting environmentally friendly Green AI. In modern machine learning, deep neural networks, while powerful, face challenges like catastrophic forgetting (losing knowledge from previous tasks during new task learning) and negative interference (previously learned knowledge hindering new task learning). These issues arise from the stability-plasticity dilemma, which necessitates finding the right balance between preserving past knowledge (stability) and acquiring new knowledge (plasticity). Efficient lifelong learning systems must address this dilemma, along with other considerations like supporting online data streams, utilizing small and fixed memory buffer capacity (if any), and learning from unlabeled data streams. 

In this thesis, we derive inspiration from the biological learning process and recent progress in deep learning to enable efficient lifelong learning systems. We propose injecting inductive biases into the three main components of data-driven machine learning: model (architecture & initialization), training (objective & optimization), and data. This thesis is structured into three parts, each corresponding to one of these components. In the first part, we explore the role of pre-trained initializations, revealing their implicit alleviation of forgetting compared to random ones. Next, we design a parameter-efficient expert architecture that dynamically expands learning capacity to address the stability-plasticity dilemma. In the second part, we demonstrate that explicit optimization for flat minima improves network stability and introduce a meta-learning objective for stability-plasticity balance. The third part delves into lifelong semi-supervised learning, addressing the stability-plasticity dilemma by rehearsing pseudo-labeled data. We conclude by examining pre-training from the perspective of lifelong learning, showcasing enhancements by applying the above-developed strategies to the (continual) pre-training of models 




Degree Type

  • Dissertation


  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)


Emma Strubell