Towards Robust Large-scale Audio/Visual Learning

Li, Juncheng

doi:10.1184/R1/23995602.v1

junchenl_PhD_LTI_2023.pdf (26.72 MB)

Towards Robust Large-scale Audio/Visual Learning

thesis

posted on 2023-08-25, 18:59 authored by Juncheng LiJuncheng Li

Audio Visual Event Detection has benefited greatly from the advancement of deep learning in the past few years. Various model architectures have been applied to the task in multiple modalities, pushing the performance benchmark and enabling the deployment of such models in many critical tasks such as surveillance and malicious content filtering. However, the research community still lacks: 1) a systematic understanding of the different machine learning models’ behavior given the unique nature of audio signals compared to the image or text counterparts. 2) The robustness of different models used for audio-visual learning also remains to be an under-studied area.

The first goal of this thesis is to investigate best practices for building an audio-only and audio-visual learning system that performs well. Specifically, we analyze the features, compare different architectures, and understand the difference in training techniques to provide a comprehensive and thorough understanding. Our investigation traces the evolution of models from the convolutional family to the Transformer family, and the transition in learning paradigms from supervised learning to self-supervised learning. (This part is elaborated in Chapters 2,3,4,5,6,7)

The second goal is to study the robustness of each model by gauging their behavior under noise and adversarial perturbation. We first demonstrate the existence of real-world threats caused by adversarial perturbation in both the visual and audio domains. Following this, we broaden our adversarial robustness analysis beyond the scope of unimodal audio input to include a myriad of modalities such as audio, video, imagery, and text. (This part is covered in Chapters 8,9,10, 11) Further, we extend our research purview to include a comparative study between adversarial robustness and noise robustness (Chapter 12 ). Aiming at fulfilling the promise of both generalization and robustness in audio-visual learning, we present our audio-journey diffusion system. We utilize the diffusion model as an effective data augmentation instrument, adding semantically diverse samples to enhance performance, demonstrating potential for generalization. Additionally, we take advantage of the diffusion model’s innate denoising capabilities, suggesting that it could readily enhance the robustness of existing audio classification systems. (Chapter 13 )

History

Date

2023-07-31

Degree Type

Dissertation

Department

Language Technologies Institute

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Florian Metze

Usage metrics

Keywords

Audio/Visual Machine Learning Robustness audio event detection Adversarial Robustness

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Towards Robust Large-scale Audio/Visual Learning

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports