aayushb_phd_thesis_robotics_2021.pdf (118.57 MB)
Download file

Unsupervised Learning of the 4D Audio-Visual World from Sparse Unconstrained Real-World Samples

Download (118.57 MB)
posted on 23.04.2021, 14:46 by Aayush BansalAayush Bansal
We, humans, can easily observe, explore, and analyze our four-dimensional (4D) audio-visual world. We, however, struggle to share our observation, exploration, and analysis with others. In this thesis, our goal is to learn a computational representation of the 4D audio-visual world that can be: (1) estimated from sparse real-world observations; and (2) explored to create new experiences. We introduce Computational Studio for observing, exploring, and creating the 4D audio-visual world, thereby allowing humans to communicate with other humans and machines
effectively without any loss of information. Computational Studio serves as an environment for non-experts to construct and creatively edit the 4D audio-visual world
from sparse real-world samples. There are three essential components of the Computational Studio: (1) How can we densely observe the 4D visual world?; (2) How can we communicate the audio-visual world using examples?; and (3) How can we interactively explore the audio-visual world?
The first part introduces capturing, browsing, and reconstructing the 4D visual world from sparse real-world multi-view samples. We bring together insights from
classical image-based rendering and neural rendering approaches. Crucial to our work are two components: (1) Fusing information from sparse multi-views to create
dense 3D point clouds; and (2) Fusing multi-view information to create new views. Though captured from discrete viewpoints, the proposed formulation allows us to do dense 3D reconstruction and 4D visualization of dynamic events. It
also enables us to move around the space-time of the event continuously and facilitate: (1) Freezing the time and exploring 3D space; (2) Freezing the 3D space and
moving through time; and (3) Simultaneously changing both time and 3D space. Without any external information, our formulation allows us to get a dense depth map and a foreground-background segmentation, which enables us to efficiently track objects in a video. In turn, these properties allow us to edit the videos and reveal occluded things in a 3D space, provided it is visible in any view. The second part details the example-based synthesis of the audio-visual world in an unsupervised manner. Example-based audio-visual synthesis allows us to express ourselves easily. In this part,we introduce Recycle-GANthat combines spatial
and temporal information via adversarial losses for an unsupervised video retargeting. This representation allows us to translate the contents from one domain to another while preserving the style native to the target domain. E.g., if our goal is to transfer the contents of John Oliver’s speech to Stephen Colbert, then the generated content/speech should be in Stephen Colbert’s own style. We then extend
our work to audio-visual synthesis using Exemplar Autoencoders. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution
of the training set. We use Exemplar Autoencoders to learn the voice, stylistic prosody (emotions and ambiance), and visual appearance of a specific target exemplar speech. This work enables us to synthesize a natural voice for speech-impaired individuals and do a zero-shot multi-lingual translation. Finally, we introduce PixelNN, a semi-parametric model that enables us to generate multiple outputs from
a given input and examples. The third part introduces human-controllable representations that allow a human
user to interact with visual data and create new experiences on everyday computational devices. Firstly, we introduce OpenShapes that allows a user to interactively
synthesize new images using a paint-brush and a drag-and-drop tool. Open- Shapes runs on a single-core CPU to create multiple pictures from a user-generated label map. We then present simple video-specific autoencoders that enable human controllable video exploration. This exploration includes a wide variety of videoanalytic
tasks such as (but not limited to) spatial and temporal super-resolution, object removal, video textures, average video exploration, associating various videos, video retargeting, and correspondence estimation within and across videos. Prior work has independently looked at each of these problems and proposed different formulations. We observe that a simple autoencoder trained (from scratch) on multiple
frames of a specific video enables one to perform a large variety of video processing and editing tasks without even optimizing for a single task. Finally, we present a framework that allows us to extract a wide range of low-mid-high level
semantic and geometric scene cues that could be understood and expressed by both humans and machines.
We follow the concept of exemplar and test-time training for various formulations proposed in this thesis. This unique combination allows us to continually learn the audio-visual world in a streaming manner. The last part of this thesis
extends our work on continual and streaming learning of the audio-visual world to learning visual-recognition tasks given a few labeled examples and a (potentially) infinite stream of unlabeled examples. Our approach continually improves
a task-specific representation without any task-specific knowledge. We construct a schedule of learning updates that iterates between pre-training on novel segments
of the stream, and fine-tuning on the small and fixed labeled dataset. Contrary to popular approaches that use massive computing resources for storing and processing
data, streaming learning requires modest computational infrastructure since it naturally breaks up massive datasets into slices that are manageable for processing. Streaming learning can help democratize research and development for scalable and lifelong ML. Computational Studio is a first step towards unlocking the full degree of creative
imagination, which is currently limited to the human mind by the limits of the individual’s expressivity and skills. It has the potential to change the way we audio-visually communicate with other humans and machines.




Degree Type



Robotics Institute

Degree Name

  • Doctor of Philosophy (PhD)


Deva K. Ramanan Yaser A. Sheikh

Usage metrics