posted on 2025-07-23, 19:24authored byAnurag Kumar
<p dir="ltr">One of the desiderata in machine intelligence is that machines must be able to comprehend sounds as humans do. They must know about various sounds, associate them with physical objects, entities or events, be able to recognize and categorize them and know or be able to discover relationships between them. We call this intelligence, Acoustic Intelligence. </p><p dir="ltr">Automated machine understanding of sounds in specific forms such as speech and music, has become relatively advanced, and have been successfully deployed in systems which are now part of our daily life. However, the same cannot be said about other naturally occurring sounds in our environment. Speech production is constrained by our vocal cords, restricting in certain senses the input space over which machines need to work for their automated understanding. However, naturally occurring sounds are entirely unrestricted. The problem is exacerbated by the sheer vastness of the number of sound types, the diversity, and variability of sound types, the variations in their structure, and even their interpretation. </p><p dir="ltr">We formalize acoustic intelligence in machines as consisting of two main problems; first, in which we aim to acquire commonsense knowledge about sounds and the second, where we consider the problem of recognizing their presence in audio recordings. The first one requires natural language understanding of sounds and the second one is about large-scale recognition and detection of sound events and scenes. On the natural language understanding front, we develop methods to identify and catalog audible phrases and then to extract higher-level semantic relationships for sounds using these audible phrases. To the best of our knowledge, this is the first work to extract sound related knowledge from textual data. </p><p dir="ltr">On the sound event recognition front, we address the primary barrier of lack of labeled data for sounds. We propose to learn from weakly labeled data and show for the first time that audio event detectors can be trained using weakly labeled data. We formulate the problem through multiple instance learning and then describe several methods under this framework for weakly supervised audio event detection. We then give deep learning methods for weakly labeled audio event detection as well, leading to the state of the art performances on several datasets. We show that these deep learning methods can be further employed in transfer learning for sounds. Finally, on the weak label front, we also propose a unified learning framework which leverages both strongly and weakly labeled data at the same time. </p><p dir="ltr">The above methods try to address labels challenges in the learning phase. We attempt to address the challenges during the evaluation phase as well. Evaluation of trained models on a large scale test set once again requires data labeling. We describe methods to precisely estimate the performance of a trained model under restricted labeling budget.</p>