Pose Machines: Estimating Articulated Pose in Images and Video
The articulated motion of humans is varied and complex. We use the range of motion of our articulated structure for functional tasks such as transport, manipulation, communication, and self-expression. We use our limbs to gesture and signal intent. It is therefore crucial for an autonomous system operating and interacting in human environments to be able to reason about human behavior in natural, unconstrained settings. This requires reliably extracting compact representations of behavior from possibly noisy sensors in a computationally efficient manner. The goal of this thesis is to develop computational methods for extracting compact keypoint representations of human pose from unconstrained and uncontrolled real-world images and video. Estimating articulated human pose from unconstrained images is an extremely challenging task due to complexity arising from the large number of kinematic degrees of freedom of the human body, large appearance and viewpoint variability, imaging artifacts and the inherent ambiguity when reasoning about three dimensional objects from two dimensional images. A core characteristic of the problem is the trade-off between the complexity of the human pose model used and the tractability of drawing inferences from it: as we increase model fidelity by either incorporating structural and physical constraints or making fewer limiting assumptions, the problem of searching for the optimal pose configuration becomes increasingly difficult and intractable. Cognizant of this trade-off, in this thesis, we develop methods to reason about articulated human pose from single images by developing a modular sequential prediction framework called a Pose Machine. Pose Machines reduce the structured prediction problem of articulated pose estimation to supervised multi-class classification. The modular framework allows for integrating the latest advances in supervised prediction, incorporates informative cues across multiple resolutions, learns rich implicit spatial models by making fewer limiting assumptions, handles large real-world datasets, and can be trained in an end-to-end manner. Additionally we develop methods for estimating pose from image sequences and reconstructing pose in three dimensions by finding tractable substructures to incorporate physicial and structural constraints while maintaining tractability.