spurushw_phd_robotics_2022.pdf (24.53 MB)
Download file

Visual Representation and Recognition without Human Supervision

Download (24.53 MB)
posted on 09.06.2022, 20:25 authored by Senthil Purushwalkam ShSenthil Purushwalkam Sh

The advent of deep learning based artificial perception models has revolutionized the field of computer vision. These methods take advantage of the ever-growing computational capacity of machines and the abundance of human-annotated data to build supervised learners for a wide-range of visual tasks. However, the reliance  on human-annotated is also a bottleneck for the scalability and generalizability of these methods. We argue that in order to build more general learners (akin to an infant), it is crucial to develop methods that learn without human-supervision. In this thesis, we present our research on minimizing the role of human-supervision for two key problems: Representation and Recognition. 

Recent self-supervised representation learning (SSL) methods have demonstrated impressive generalization capabilities on numerous downstream tasks. In this thesis, we investigate these approaches and demonstrate that they still heavily rely on the availability of clean, curated and structured datasets. We experimentally demonstrate that these learning capabilities fail to extend to data collected “in-thewild” and hence, expose the need for better benchmarks in self-supervised learning. We also propose novel SSL approaches that minimize this dependence on curated data. 

Since exhaustively collecting annotations for all visual concepts is infeasible, methods that generalize beyond the available supervision are crucial for building scalable recognition models. We present a novel neural network architecture that takes advantage of the compositional nature of visual concepts to construct image classifiers for unseen concepts. For domains where collecting dense annotations is infeasible, we present an “understanding via associations” paradigm which reformulates the recognition problem as identification of correspondences. We apply this to videos and show that we can densely describe videos by identifying dense spatiotemporal correspondences to other similar videos. Finally, to explore the human ability of generalizing beyond semantic categories, we introduce the “Functional Correspondence Problem” and demonstrate that representations that encode functional properties of objects can be used to recognize novel objects more efficiently.




Degree Type



Robotics Institute

Degree Name

  • Doctor of Philosophy (PhD)


Abhinav Gupta

Usage metrics