Carnegie Mellon University
Sudharshan_thesis.pdf (39.21 MB)

Perception amidst interaction: spatial AI with vision and touch for robot manipulation

Download (39.21 MB)
posted on 2024-03-07, 17:06 authored by Sudharshan SureshSudharshan Suresh

 Robots currently lack the cognition to replicate even a fraction of the tasks humans do, a trend summarized by Moravec’s Paradox. Humans effortlessly combine their senses for everyday interactions—we can rummage through our pockets in search of our keys, and deftly insert them to unlock our front door. Before robots can demonstrate such dexterity, they must first exhibit spatial awareness of the objects they manipulate. Specifically, object pose and shape are important quantities for downstream planning and control. The status quo for in-hand perception is restricted to the narrow scope of tracking known objects with vision as the dominant modality. As robots move out of instrumented labs and factories to cohabit our spaces, it is clear that a missing piece is generalizable spatial AI. 

Often overlooked is tactile sensing, which provides a direct window into robot-object interaction, free from occlusion and aliasing. With hardware advances like vision-based touch, we now have situated yet detailed information to complement cameras. However, interactive perception is intrusive—the act of sensing itself perturbs the object. Can we robustly estimate object shape and pose online from a stream of multimodal robot manipulation data? 

In this thesis, I study the intersection of simultaneous localization and mapping (SLAM) and robot manipulation. More specifically, I look at: (1) spatial representations for object-centric SLAM, (2) tactile perception and simulation, and (3) combining learned models with online optimization. First, I show how factor graphs fuse touch with physics-based constraints for SLAM in planar manipulation (Chapter 2). Next, I present a schema for online shape learning from visuo-tactile sensing (Chapter 3). I then demonstrate a learned tactile representation for global localization via touch (Chapter 4). Drawing upon the above efforts, I culminate with unifying vision, touch and proprioception into a neural representation for SLAM during in-hand manipulation (Chapter 5) 




Degree Type

  • Dissertation


  • Robotics Institute

Degree Name

  • Doctor of Philosophy (PhD)


Michael Kaess

Usage metrics



    Ref. manager