Learning Structured and Deep Representations for Traffc Scene Understanding
Recent advances in representation learning have led to an increasing variety of vision-based approaches in traffic scene understanding. This includes general vision problems such as object detection, depth estimation, edge/boundary/contour detection, semantic segmentation and scene classification, as well as application-driven problems such as pedestrian detection, vehicle detection, lane marker detection and road segmentation, etc. In this thesis, we approach some of these problems by exploring structured and invariant representations from the visual input. Our research is mainly motivated by two facts: 1. Traffic scenes often contain highly structured layouts. Exploring structured priors is expected to help considerably in improving the scene understanding performance. 2. A major challenge of traffic scene understanding lies in the diverse and changing nature of the contents. It is therefore important to find robust visual representations that are invariant against such variability. We start from highway scenarios where we are interested in detecting the hard road borders and estimating the drivable space before such physical boundary. To this end, we treat the task as a joint detection and tracking problem, and formulate it with structured Hough voting (SVH): A conditional random field model that explores both intra-frame geometric and interframe temporal information to generate more accurate and stable predictions. Turning from highway scenes to urban scenes, we consider dense prediction problems such as category-aware semantic edge detection and semantic segmentation. Category-aware semantic edge detection is challenging as the model is required to jointly localize object contours and classify each edge pixel to one or multiple predefined classes. We propose CASENet, a multilabel deep network with state of the art edge detection performance. To address the label misalignment problem in edge learning, we also propose SEAL, a framework towards simultaneous edge alignment and learning. Failure across different domains has been a common bottleneck of semantic segmentation methods. In this thesis, we address the problem of adapting a segmentation model trained on a source domain to another different target domain without knowing the target domain labels, and propose a class-balanced self-training approach for such unsupervised domain adaptation. We adopt the \synthetic-to-real" setting where a model is pre-trained on GTA-5 and adapted to real world datasets such as Cityscapes and Nexar, as well as the \cross-city" setting where a model is pre-trained on Cityscapes, and adapted to unseen data from Rio, Tokyo, Rome and Taipei. Experiment shows the superior performance of our method compared to state of the art methods, such as adversarial training based domain adaptation.