Continual Learning on Speech and Audio: Towards Data, Model and Metrics
In recent years, the community has witnessed the enormous progress of deep neural network models in matching or even surpassing human performance on a variety of speech and audio tasks, including Automatic Speech Recognition (ASR), Spoken Lan guage Understanding (SLU), Text-to-Speech (TTS), etc. However, their impressive and powerful achievement is predominantly dependent on training with a large set of data defined by a particular and rigid task. In such a paradigm, the model is expected to learn universal knowledge from a static entity of data and stationary environments. In contrast, the real world is inherently ever-changing and non-stationary. New data is often generated and collected every second in a stream format, and novel classes may also emerge from time to time. Without proper adaptation techniques, the knowledge learned in the past might be erased easily when the model is learning subsequent tasks, thus resulting in overall performance degradation. Such a phenomenon is called catas trophic forgetting, which limits the practical use and expansion of many deep neural network models.
Continual learning has emerged as a new machine learning paradigm that enables arti f icial intelligence (AI) systems to learn from a continuous stream of data and incremen tally improve their performance over time. By adapting to changing environments and user needs, continual learning aims to address the catastrophic forgetting effect, so that the model can gradually extend the knowledge it acquires without drastically forgetting the knowledge that has been learned in the past. Such a property is crucial in practical applications to enable artificial systems to learn from the infinite streams of data of the changing world in a lifelong manner.
This thesis mainly focuses on the underexplored area of how continual learning tech niques can be effective in speech and audio tasks via three perspectives: data, model, and metrics. We will introduce the background and formulations of multiple continual learning scenarios, including data-incremental, class-incremental, and task-incremental settings. Then we will present how different categories of continual learning scenarios and methodscanbeapplied to different modules of the modeling pipeline. Starting from the taxonomy of methods, we propose to improve continual learning towards the three perspectives. First, we demonstrate how to address data sampling, selection, and im balance to help with continual learning on different audio tasks. Second, we show how the joint use of model architecture and data with different learning strategies could ben efit continual learning processes. Lastly, we propose new continual evaluation metrics vii to give us a comprehensive and deeper understanding of the general continual learning behaviors. We believe that this thesis provides an overall exploration of continual learn ing scenarios in various speech and audio tasks, and makes an important step towards realizing lifelong learning of speech interfaces.
History
Date
2024-06-10Degree Type
- Dissertation
Department
- Electrical and Computer Engineering
Degree Name
- Doctor of Philosophy (PhD)