Computational Audition with Imprecise Labels
Sounds are essential to our physical environment and play a critical role in allowing us to interact with it effectively. Throughout our lives, we develop the ability to interpret and understand the myriad sounds surrounding us, allowing us to navigate and function seamlessly within our environments. The goal of computational audio processing or computational audition is to accurately emulate this ability of the human brain (and even that of animals). At the broadest level, this thesis explores the challenge of teaching machines to interpret and understand the acoustic landscape.
For machines to accurately interpret the acoustic environment, they must be able to distinguish all types of sounds. However, the range of possible sounds in this world is vast and their complete variety is unknown. Current audio understanding and interpretation approaches are constrained to recognizing a limited subset of “known” sounds (or sound events) in digital audio recordings. Even within the subset of known sounds, current modeling approaches for detecting and interpreting them require large labeled datasets to achieve good performance. In real-world scenarios, the scarcity of labeled data often hampers the effectiveness of supervised learning algorithms, limiting their scalability and applicability.
A central contribution of this thesis is the development of methods for detecting known sound events without the need for extensive and accurately labeled data. Models trained in the absence of accurately labeled data often perform suboptimally, regardless of the size of the dataset used. This work addresses the problem of formulating effective strategies for modeling sound with imprecisely labeled training data. In other words, we address the problem of building accurate models from weakly labeled data. The term weak labels also refers to labels that only indicate the presence or absence of an event without specifying the exact temporal boundaries of an event.
Accurate labeling of large datasets is both time-consuming and costly and is sensitive to human judgment, thus requiring trained annotators with domain knowledge. Unfortunately, large training datasets almost always contain examples with inaccurate or imprecise labels. Inaccurate labels are also called “noisy” labels. Algorithms trained using noisy labels typically underperform in detection tasks. In addition to label noise, which comprises inaccuracies in label identity and quality, there are often also additional sources of noise that result in the degradation of model performance. These stem from poor signal quality, which can result from many factors (e.g., coding, channel, and transmission errors). In the presence of weak labels and signal noise, we encounter a paradoxical situation where providing more data to deep networks or other data-driven learning frameworks actually degrades performance and introduces biases that are essentially tantamount to memorizing training label noise patterns.
One of the solutions explored in this thesis is to strengthen the available annotations. However, it is also equally, if not more important to develop techniques that impose less stringent labeling requirements on training data, while allowing the models to learn better. For the large part, we focus on these in our work. Our specific strategies include the following:
- Studying the effects of label noise, label corruption, and label density in the process of learning with weak labels.
- Building a novel co-training approach to learn sound events from the web (internet) data without any human labeling (we show that our solution results in significant improvement over models that are trained directly with weakly labeled data).
- Developing strategies to exploit additional cues that add negligible annotation or computational overhead. Examples of such cues are counts of sound events, proportions of the sound events (in terms of occurrence), duration of sound events, etc. We refer to such cue-strengthened labels as semi-weak labels.
- Developing strategies based on the notion of negatives for sound labels (weak or strong), including exploiting all available information in the recording to complement or contrast the target label.
- Improving audio segmentation using imprecise labels.
- Examining the importance of the proximity of counts in a bag for semi-weak label learning by controlled factoring of the distribution of the count of weak labels, we examine how such factoring of the recordings influences learning.
As a final major advancement in this thesis, we also address the problem of learning without labels. We devise a novel method to perform unsupervised adaptation on multimedia datasets as a partial solution to this problem. The strategy we propose also generates new labels for the data, using which we are in the process of creating what we believe is currently the largest dataset of sound events with labels that have been computed in a completely unsupervised fashion.
Furthermore, we explore the feasibility of devising a unified framework to encapsulate all aforementioned approaches, namely learning with weak labels, learning without labels, combining weak and strong labels, using weak labels with additional cues (semi-weak), learning in the presence of noise, etc. In this unified framework, all these strategies will become special cases that can be invoked as necessary during the modeling.
This thesis establishes a robust and coherent foundation for future research in computational audition, offering innovative approaches for modeling and understanding the world of sounds under varying degrees of label precision, label quantity, and data quality.
History
Date
2024-10-01Degree Type
- Dissertation
Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)