Towards Deployable Reinforcement Learning: Safety, Robustness, Adaptivity, and Scalability
The increasing demand to apply reinforcement learning (RL) in safety-critical domains accentuates the essential need for safe, robust, and versatile RL algorithms. This thesis directly addresses this imperative by introducing a suite of advanced policy optimization algorithms aimed at overcoming the key challenges faced by safe RL, thus paving the way toward more reliable and practical deployments.
The first part of the thesis focuses on enhancing sample efficiency and training stability — crucial aspects of deployable safe RL. We propose the Constrained Variational Policy Optimization (CVPO) method, which reformulates the safe RL problem into a two-stage optimization process. This approach not only ensures efficient and stable learning but also provides strong performance guarantees, making it a superior choice for practical safe RL applications in terms of both safety and sample efficiency.
The second part of the thesis delves into robustness, a critical component of deployable RL, against observational perturbations. We uncover the vulnerability of learned safe policies to stealthy yet unsafe behavior inductions. Our findings emphasize the need for robust adversarial training to improve safety under adverse conditions. Building on this, we first introduce an on-policy adversarial training pipeline and then present SAFER, an off-policy method derived from CVPO, which effectively enhances policy robustness and safety in adversarial settings.
Lastly, the thesis addresses the adaptivity and scalability issue of deployable RL by learning from static offline datasets. It introduces the Constrained Decision Transformer (CDT), a novel approach utilizing sequential modeling techniques to allow dynamic adjustment of the trade-offs between safety and task performance during deployment. Alongside CDT, the thesis proposes TAIL, a scalable training paradigm for continual learning, efficiently adapting pretrained models to new tasks while mitigating catastrophic forgetting and overfitting.
In summary, this thesis is dedicated to pushing the boundaries of safe, robust, and scalable policy optimization, marking strides toward deployable RL in safety-critical areas. The proposed methods offer robust, efficient, and adaptable solutions, crucial for the real-world deployment of RL systems.
History
Date
2024-01-10Degree Type
- Dissertation
Department
- Mechanical Engineering
Degree Name
- Doctor of Philosophy (PhD)