cdann_MachineLearning_2019.pdf (3.36 MB)
Download file

Strategic Exploration in Reinforcement Learning - New Algorithms and Learning Guarantees

Download (3.36 MB)
posted on 24.02.2020, 18:04 authored by Christoph DannChristoph Dann
Reinforcement learning (RL) focuses on an essential aspect of intelligent behavior – how an agent can learn to make good decisions given experience and rewards in a stochastic
world. Yet popular RL algorithms that have enabled exciting successes in domains with good simulators (Go, Atari, etc) still often fail to learn in other domains because they rely on
simple heuristics for exploration. This provides additional empirical justification for essential questions around RL, specifically around algorithms that learn in a provably efficient manner through strategic exploration in any considered domain. This thesis provides new algorithms
and theory that enable good performance with respect to existing theoretical frameworks for evaluating RL algorithms (specifically, probably approximately correct) and introduces new stronger evaluation criteria, that may be particularly of interest as RL is applied to more real world problems.
For the first line of work on probably approximately correct (PAC) RL algorithms, we introduce a series of algorithms for episodic tabular domains with substantially better PAC
sample complexity bounds that culminate in a new algorithm with close to minimax optimal PAC and regret bounds. Look up tables are required by most sample efficient and computationally tractable algorithms, but cannot represent many practical domains. We therefore also present a new RL algorithm that can learn a good policy in environments with high dimensional observations and hidden deterministic states; unlike predecessors, this algorithm provably
explores not only in a statistically but also computationally efficient manner assuming access to function classes with efficient optimization oracles. To make progress it is critical to have the right measures of success. While empirical
demonstrations are quite clear, we find that for theoretical properties, two of the most commonly used learning frameworks, PAC guarantees and regret guarantees, each allow undesirable algorithm behavior (e.g. ignoring new observations that could improve the policy). We present
a new stronger learning framework called Uniform-PAC that unifies the existing frameworks and prevents undesirable algorithm properties. One caveat of all existing learning frameworks is that for any particular episode, we do not
know how well the algorithm will perform. To address this, we introduce the IPOC framework that requires algorithms to provide a certificate before each episode bounding how suboptimal the current policy can be. Such certifications may be of substantial interest in high stakes scenarios when an organization may wish to track or even pause an online RL system should the potential expected performance bound drop below a required expected outcome.




Degree Type



Machine Learning

Degree Name

  • Doctor of Philosophy (PhD)


Emma Brunskill