Carnegie Mellon University
Browse
- No file added yet -

Statistical Inference for Policy Evaluation in Reinforcement Learning

Download (2.27 MB)
thesis
posted on 2024-09-04, 21:43 authored by Weichen WuWeichen Wu

Policy evaluation plays a critical role in many scientific and engineering applications of Reinforcement Learning (RL), ranging from clinical trials to mobile health, robotics, and autonomous driving. Among the many RL algorithms used for policy evaluations, Temporal Difference (TD) learning and its variants are arguably the most popular. Despite the widespread use and practical significance of policy evaluation via TD learning, practitioners currently lack the necessary statistical tools to support their decision-making. This thesis aims to address this issue by developing theories and methods to perform statistical inference for policy evaluation using TD learning estimators. 

In the first part of the thesis, we derive novel and sharp non-asymptotic bounds on the estimation error of TD learning procedures with linear function approximation. Assuming independent samples, we have formulated sharp sample complexity bounds for both averaged TD learning and two time-scale TD learning with gradient correction. In the on-policy settings, our results for averaged TD learning improve significantly over the previous state-of-the-art bounds by a factor that can scale linearly with the dimension of the state space. In the on-policy scenarios, our upper bound is the first one to deliver a minimax optimal scaling with respect to the tolerance level while exhibiting an explicit dependence on all the problem-related parameters. 

In the second part of the thesis, we focus on the on-policy-setting, and de?velop valid and efficient inference procedures for TD learning-based estimators. We leverage novel, finite sample distributional approximations of TD estimators under different choices of stepsizes and with both i.i.d. and Markov samples. We achieve the most advanced Berry-Esseen bounds that control the rate by which the TD estimation errors converge to their corresponding asymptotic distributions, and formulate an online algorithm to construct confidence intervals based on these results. We demonstrate the validity of the confidence intervals for both independent samples and Markovian trajectories. 

History

Date

2024-08-01

Degree Type

  • Dissertation

Department

  • Statistics and Data Science

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Alessandro Rinaldo Yuting Wei

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC