Carnegie Mellon University
sbahl2_phd_ri_2024.pdf (42.85 MB)

Watch, Practice, Improve: Towards In-the-wild Manipulation

Download (42.85 MB)
posted on 2024-03-07, 16:21 authored by Shikhar Bahl

 The longstanding dream of many roboticists is to see robots perform diverse  tasks in diverse environments. To build such a robot that can operate anywhere,  manymethods train on robotic interaction data. While these approaches have led  to significant advances, they rely on heavily engineered setups or high amounts  of supervision, neither of which is scalable. How can we move towards training  robots that operate autonomously, in the wild? Unlike computer vision and natural  language in which a staggering amount of data is available on the internet, robotics  faces a chicken-and-egg problem: to train robots to work in diverse scenarios, we  need a large amount of robot data from diverse environments but to collect this kind  of data, we need robots to be deployed widely- which is feasible only if they are  already proficient. How can we break this deadlock?  

The proposed solution, and the goal of my thesis, is to use an omnipresent  source of rich interaction data– humans. Fortunately, there are plenty of real world human interaction videos on the internet, which can help bootstrap robot  learning by side-stepping the expensive aspects of the data collection-training loop.  To this end, we aim to learn manipulation from watching humans perform various  tasks. We circumvent the embodiment gap by imitating the effect the human has  on the environment, instead of the exact actions. We obtain interaction priors, and  subsequently practice directly in the real world to improve. To move beyond explicit  human supervision, the second work in the thesis aims to predict robot-centric  visual affordances: where to interact and how to move post interaction, directly from  offline humanvideo datasets. We show that this model can be seamlessly integrated  into any robot learning paradigm. The third part of the thesis focuses on how to  build general-purpose policies by leveraging human data. We show that world  models are strong mechanisms to share representations across human and robot  data coming from many different environments. We use a structured affordance based action space to train multitask policies and show that this greatly boosts  performance. In the fourth work of the thesis, we investigate how to use human data  to build actionable representations for control. Our key insight is to move beyond  traditional training of visual encoder and use human actions and affordances to  improve the model. We find that this approach can improve real-world imitation  learning performance for almost any pre-trained model, across multiple challenging  tasks. Finally, visual affordances may struggle to capture complex action spaces,  especially in high-degree-of-freedom robots such as dexterous hands. Thus, in the  f  inal works of the thesis, we explore how to learn more explicit, physically grounded  action priors from human videos, mainly in the context of dexterous manipulation. 




Degree Type

  • Dissertation


  • Robotics Institute

Degree Name

  • Doctor of Philosophy (PhD)


Deepak Pathak Abhinav Gupta

Usage metrics


    Ref. manager