Provable Algorithms for Reinforcement Learning: Efficiency, Scalability, and Robustness
Reinforcement learning (RL), which strives to learn desirable sequential decisions based on trial-and-error interactions with an unknown environment, has achieved remarkable success recently in a variety of domains including games and large language model alignment. In the face of unknown environments with unprecedentedly large dimensionality, making the best use of available samples inevitably lies at the core of RL, especially in ubiquitous sample-starved scenarios such as clinical trials and autonomous driving. To understand and tackle the challenges of sample efficiency, substantial progress has been made recently by developing a finite-sample theoretical framework to analyze the algorithms of interest and design provably optimal algorithms in terms of sample efficiency. Nevertheless, existing results still fall short with regards to the statistical understanding and algorithmic optimality in a wide range of RL settings. Moreover, motivated by countless scenarios with large dimensionality or sim-to-real gaps, sample efficiency needs to be considered along with scalability and robustness — two equally important principles in RL.
This thesis breaks down the sample barriers of various RL formulations, taking additional scalability and robustness into account. Specifically, for online RL that allows for adaptive interactions with the environment, this thesis provides the first provable regret-optimal model-free RL algorithm with a small burn-in cost — an initial sampling burden needed for the algorithm to exhibit the desired performance — while maintaining its memory efficiency for scalability. For offline RL that only has access to historical datasets, this thesis proposes the first provable near-optimal model-free offline RL algorithm without the need of performing model estimation, and settles the sample complexity by establishing the minimax optimality of model-based offline RL algorithms without burn-in cost. Finally, for a robust variant of standard RL — distributionally robust RL, this thesis uncovers a surprising fact that promoting additional distributional robustness in the learned policy is neither necessarily harder nor easier than standard RL in terms of sample requirements, which depends heavily on the prescribed uncertainty set. This thesis closes by providing the first provable near-optimal algorithm for offline robust RL that can learn under simultaneous model uncertainty and limited historical datasets.
- Electrical and Computer Engineering
- Doctor of Philosophy (PhD)