Carnegie Mellon University
junchenl_PhD_LTI_2023.pdf (26.72 MB)

Towards Robust Large-scale Audio/Visual Learning

Download (26.72 MB)
posted on 2023-08-25, 18:59 authored by Juncheng LiJuncheng Li

Audio Visual Event Detection has benefited greatly from the advancement of deep learning in the past few years. Various model architectures have been applied to the task in multiple modalities, pushing the performance benchmark and enabling the deployment of such models in many critical tasks such as surveillance and malicious content filtering. However, the research community still lacks: 1) a systematic understanding of the different machine learning models’ behavior given the unique nature of audio signals compared to the image or text counterparts. 2) The robustness of different models used for audio-visual learning also remains to be an under-studied area. 

The first goal of this thesis is to investigate best practices for building an audio-only and audio-visual learning system that performs well. Specifically, we analyze the features, compare different architectures, and understand the difference in training techniques to provide a comprehensive and thorough understanding. Our investigation traces the evolution of models from the convolutional family to the Transformer family, and the transition in learning paradigms from supervised learning to self-supervised learning. (This part is elaborated in Chapters 2,3,4,5,6,7) 

The second goal is to study the robustness of each model by gauging their behavior under noise and adversarial perturbation. We first demonstrate the existence of real-world threats caused by adversarial perturbation in both the visual and audio domains. Following this, we broaden our adversarial robustness analysis beyond the scope of unimodal audio input to include a myriad of modalities such as audio, video, imagery, and text. (This part is covered in Chapters 8,9,10, 11) Further, we extend our research purview to include a comparative study between adversarial robustness and noise robustness (Chapter 12 ). Aiming at fulfilling the promise of both generalization and robustness in audio-visual learning, we present our audio-journey diffusion system. We utilize the diffusion model as an effective data augmentation instrument, adding semantically diverse samples to enhance performance, demonstrating potential for generalization. Additionally, we take advantage of the diffusion model’s innate denoising capabilities, suggesting that it could readily enhance the robustness of existing audio classification systems. (Chapter 13 ) 




Degree Type

  • Dissertation


  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)


Florian Metze

Usage metrics


    Ref. manager