Visual Representation and Recognition without Human Supervision

Purushwalkam Sh, Senthil

doi:10.1184/R1/19943162.v1

Visual Representation and Recognition without Human Supervision

thesis

posted on 2022-06-09, 20:25 authored by Senthil Purushwalkam ShSenthil Purushwalkam Sh

The advent of deep learning based artificial perception models has revolutionized the field of computer vision. These methods take advantage of the ever-growing computational capacity of machines and the abundance of human-annotated data to build supervised learners for a wide-range of visual tasks. However, the reliance on human-annotated is also a bottleneck for the scalability and generalizability of these methods. We argue that in order to build more general learners (akin to an infant), it is crucial to develop methods that learn without human-supervision. In this thesis, we present our research on minimizing the role of human-supervision for two key problems: Representation and Recognition.

Recent self-supervised representation learning (SSL) methods have demonstrated impressive generalization capabilities on numerous downstream tasks. In this thesis, we investigate these approaches and demonstrate that they still heavily rely on the availability of clean, curated and structured datasets. We experimentally demonstrate that these learning capabilities fail to extend to data collected “in-thewild” and hence, expose the need for better benchmarks in self-supervised learning. We also propose novel SSL approaches that minimize this dependence on curated data.

Since exhaustively collecting annotations for all visual concepts is infeasible, methods that generalize beyond the available supervision are crucial for building scalable recognition models. We present a novel neural network architecture that takes advantage of the compositional nature of visual concepts to construct image classifiers for unseen concepts. For domains where collecting dense annotations is infeasible, we present an “understanding via associations” paradigm which reformulates the recognition problem as identification of correspondences. We apply this to videos and show that we can densely describe videos by identifying dense spatiotemporal correspondences to other similar videos. Finally, to explore the human ability of generalizing beyond semantic categories, we introduce the “Functional Correspondence Problem” and demonstrate that representations that encode functional properties of objects can be used to recognize novel objects more efficiently.

History

Date

2022-05-10

Degree Type

Dissertation

Department

Robotics Institute

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Abhinav Gupta

Usage metrics

Keywords

recognition representation self-supervised zero-shot invariances Computer Vision

Licence

CC BY 4.0

Visual Representation and Recognition without Human Supervision

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports