Amodal Visual Scene Representations With and Without Geometry
Most computer vision models in deployment today describe the pixels of images. This does not suffice, because images are only projections of the scene in front of the camera. In this thesis we build representations that attempt to describe the scene itself. We call these representations “amodal” (i.e., without modality), emphasizing the fact that they describe elements of the scene for which we have no sensory input. We present two methods for amodal visual scene representation. The first focuses on modelling space, and proposes geometry-based methods for lifting images into 3D maps, where the objects are complete, despite partial occlusions in the imagery. We show that this representation allows for self-supervised learning from multi-view data, and yields state-of-the- art results as a perception system for autonomous vehicles, where the goal is to estimate a “bird’s eye view” semantic map from multiple sensors. The second method focuses on modelling time, and proposes geometry-free methods for tracking image elements through partial and full occlusions across a video. Using learned temporal priors and within inference optimization, we show that our model can track points across outperform flow-based and feature-matching methods on fine-grained multi-frame correspondence tasks.
- Doctor of Philosophy (PhD)