Unified Simulation, Perception, and Generation of Human Behavior
Understanding and modeling human behavior is fundamental to almost any computer vision and robotics applications that involve humans. In this thesis, we take a holistic approach to human behavior modeling and tackle its three essential aspects — simulation, perception, and generation. Throughout the thesis, we show how the three aspects are deeply connected and how utilizing and improving one aspect can greatly benefit the other aspects.
As humans live in a physical world, we treat physics simulation as the foundation of our approach. In the first part of the thesis, we start by developing a robust framework for representing human behavior in physics simulation. In particular, we model a human using a proxy humanoid agent inside a physics simulator and treat human behavior as the result of an optimal control policy for the humanoid. This framework allows us to formulate human behavior modeling as policy learning, which can be solved with reinforcement learning (RL). Since it is difficult and often suboptimal to manually design simulated agents such as humanoids, we further propose a transform-and-control RL framework for efficient and automatic design of agents that are more performant than those created by experts.
In the second part of the thesis, we study the perception of human behavior through the lens of human pose estimation where we utilize the simulation-based framework developed in the first part. Specifically, we learn a video-conditioned policy with RL using a reward function based on howthe policy-generated pose aligns with the ground truth. For both first-person and third-person human pose estimation, our simulation-based approach significantly outperforms kinematics-based methods in terms of pose accuracy and physical plausibility. The improvement is especially evident in the challenging first-person setting where the front-facing camera cannot see the person. Besides using simulation, we also propose to use human behavior generation models for global occlusion-aware human pose estimation with dynamic cameras. Concretely, we use deep generative motion and trajectory models to hallucinate poses for occluded frames and generate consistent global trajectories from estimated body poses. In the third part of the thesis, we focus on the generation of human behavior, leveraging our simulation-based framework and deep generative models. We first present a simulation-based generation approach that can generate a single future motion of a person from a first-person video. To address the uncertainty in future human behavior, we develop two deep generative
models that can generate diverse future human motions using determinantal point processes (DPPs) and latent normalizing flows respectively. Finally, extending from the single-agent setting, we further study multi-agent behavior generation where multiple humans interact with each other in complex social scenes. We develop a stochastic agent-aware trajectory generation framework that can forecast diverse and socially-aware human trajectories.
- Doctor of Philosophy (PhD)