Carnegie Mellon University
Browse
Yang_cmu_0041E_11198.pdf (2.35 MB)

Continual Learning on Speech and Audio: Towards Data, Model and Metrics

Download (2.35 MB)
thesis
posted on 2024-06-28, 15:37 authored by Muqiao YangMuqiao Yang

  In recent years, the community has witnessed the enormous progress of deep neural  network models in matching or even surpassing human performance on a variety of  speech and audio tasks, including Automatic Speech Recognition (ASR), Spoken Lan guage Understanding (SLU), Text-to-Speech (TTS), etc. However, their impressive and  powerful achievement is predominantly dependent on training with a large set of data  defined by a particular and rigid task. In such a paradigm, the model is expected to  learn universal knowledge from a static entity of data and stationary environments. In  contrast, the real world is inherently ever-changing and non-stationary. New data is  often generated and collected every second in a stream format, and novel classes may  also emerge from time to time. Without proper adaptation techniques, the knowledge  learned in the past might be erased easily when the model is learning subsequent tasks,  thus resulting in overall performance degradation. Such a phenomenon is called catas trophic forgetting, which limits the practical use and expansion of many deep neural  network models.  

Continual learning has emerged as a new machine learning paradigm that enables arti f  icial intelligence (AI) systems to learn from a continuous stream of data and incremen tally improve their performance over time. By adapting to changing environments and  user needs, continual learning aims to address the catastrophic forgetting effect, so that  the model can gradually extend the knowledge it acquires without drastically forgetting  the knowledge that has been learned in the past. Such a property is crucial in practical  applications to enable artificial systems to learn from the infinite streams of data of the  changing world in a lifelong manner.  

This thesis mainly focuses on the underexplored area of how continual learning tech niques can be effective in speech and audio tasks via three perspectives: data, model,  and metrics. We will introduce the background and formulations of multiple continual  learning scenarios, including data-incremental, class-incremental, and task-incremental  settings. Then we will present how different categories of continual learning scenarios  and methodscanbeapplied to different modules of the modeling pipeline. Starting from  the taxonomy of methods, we propose to improve continual learning towards the three  perspectives. First, we demonstrate how to address data sampling, selection, and im balance to help with continual learning on different audio tasks. Second, we show how  the joint use of model architecture and data with different learning strategies could ben efit continual learning processes. Lastly, we propose new continual evaluation metrics vii  to give us a comprehensive and deeper understanding of the general continual learning  behaviors. We believe that this thesis provides an overall exploration of continual learn ing scenarios in various speech and audio tasks, and makes an important step towards  realizing lifelong learning of speech interfaces.  

History

Date

2024-06-10

Degree Type

  • Dissertation

Department

  • Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Bhiksha Ramakrishnan

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC