Reason: Author request
until file(s) become available
Towards Modular and Differentiable Autonomous Driving
The classical “modular and cascaded” autonomy stack (object detection, then tracking, trajectory prediction, then motion planning and control) has been widely used in industry for autonomous systems such as self-driving cars due to its interpretability and fast development cycle. In this thesis, we advocate the use of such a modular stack but improve its accuracy and robustness by developing effective representations to improve each module and tightly integrating modules via a differentiable stack.
In the first part of this thesis, we focus on improving each individual module in the stack. Contributions include a pseudo-LiDAR representation that quadruples the performance of monocular 3D object detection and opens a new direction in the domain, an efficient 3D Kalman filter for multi-object tracking that achieves state-of-the-art accuracy while beyond the real-time speed, a graph-based social-aware representation to improve discriminative feature learning for data association, and a joint social temporal representation for trajectory prediction that utilizes the transformer architecture to achieve state-of-the-art performance.
On top of the progress we have made on improving each module in isolation, we will then focus on how to integrate these individual modules into an effective, robust, and differentiable stack, i.e., we jointly consider multiple modules during training and/or evaluation process: (1) First, we will talk about how to integrate object detection and tracking by extending the graph-based social-aware representation to model object relations in both detection and tracking settings, and an automatic and dynamic detection selection mechanism to better filter detections for downstream tracking; (2) Then, we innovate two frameworks to better integrate tracking and prediction: a parallelized tracking and prediction framework to alleviate compounding errors between two modules, and a multi hypothesis tracking and prediction framework to increase the robustness of prediction with respect to inputs with tracking errors; (3) Towards the full integration of detection, tracking and prediction, we also propose an affinity-based prediction framework, which no longer uses the trajectory representation in the stack. Instead, it directly uses affinity matrices as inputs for prediction, which contain “soft” information about object identity and retain strictly more information than trajectories obtained by data association in tracking. By removing the error-prone data association step, error propagation can be further reduced in this framework.
Beyond following the same order in the classical perception-then-prediction stack, we then explore an inverted prediction-then-perception stack. By inverting the order, prediction is now performed on input sensor data (e.g., point clouds), which does not require expensive labels for training and can potentially increase scalability by leveraging large-scale unlabelled sensor data. To tackle the sequential point cloud forecasting task in the first step of this stack, we first develop a deterministic LSTM autoencoder architecture for proof of concept, which however cannot deal with the inherent uncertainty of the future. Therefore, we further propose a conditional variational recurrent neural network to account for the future uncertainty, which also shows higher-fidelity predictions. Moreover, since predicting future sensor data will require our prediction model to understand the world dynamics, we hypothesize that the learned predictive representation might be beneficial to downstream motion planning. To validate our hypothesis, we integrate self-supervised point cloud prediction into an end-to-end driving policy for autonomous driving, which shows state-of-the-art closed-loop performance in the CARLA simulator.
- Robotics Institute
- Doctor of Philosophy (PhD)