Statistical Inference for Policy Evaluation in Reinforcement Learning
Policy evaluation plays a critical role in many scientific and engineering applications of Reinforcement Learning (RL), ranging from clinical trials to mobile health, robotics, and autonomous driving. Among the many RL algorithms used for policy evaluations, Temporal Difference (TD) learning and its variants are arguably the most popular. Despite the widespread use and practical significance of policy evaluation via TD learning, practitioners currently lack the necessary statistical tools to support their decision-making. This thesis aims to address this issue by developing theories and methods to perform statistical inference for policy evaluation using TD learning estimators.
In the first part of the thesis, we derive novel and sharp non-asymptotic bounds on the estimation error of TD learning procedures with linear function approximation. Assuming independent samples, we have formulated sharp sample complexity bounds for both averaged TD learning and two time-scale TD learning with gradient correction. In the on-policy settings, our results for averaged TD learning improve significantly over the previous state-of-the-art bounds by a factor that can scale linearly with the dimension of the state space. In the on-policy scenarios, our upper bound is the first one to deliver a minimax optimal scaling with respect to the tolerance level while exhibiting an explicit dependence on all the problem-related parameters.
In the second part of the thesis, we focus on the on-policy-setting, and de?velop valid and efficient inference procedures for TD learning-based estimators. We leverage novel, finite sample distributional approximations of TD estimators under different choices of stepsizes and with both i.i.d. and Markov samples. We achieve the most advanced Berry-Esseen bounds that control the rate by which the TD estimation errors converge to their corresponding asymptotic distributions, and formulate an online algorithm to construct confidence intervals based on these results. We demonstrate the validity of the confidence intervals for both independent samples and Markovian trajectories.
History
Date
2024-08-01Degree Type
- Dissertation
Department
- Statistics and Data Science
Degree Name
- Doctor of Philosophy (PhD)