Counterfactual MDPs: Planning Beyond Direct Control

Silva, Rui

doi:10.1184/R1/14428145.v1

CMU-CS-20-122.pdf (4.05 MB)

Counterfactual MDPs: Planning Beyond Direct Control

thesis

posted on 2021-04-21, 21:15 authored by Rui Silva

Planning under uncertainty using Markov decision processes (MDPs) requires a model of the environment specifying the probabilistic effects of the actions that

the agent is able to execute. The optimal policy of the agent is then computed from such model. As such, when planning, the agent assumes the environment may only change as the direct result of its actions. In this thesis we consider lifting that assumption, and allow the agent to reason over the counterfactual “What if the world were different?” Effectively, we allow the agent to reason over other possible configurations of the world, where more rewarding optimal policies may exist, and over the cost required in shifting the original environment to these modified worlds. Our

goal is to endow the agent with the ability to plan over the possibility of actually operating in such configurations of the world, if beneficial. We introduce Counterfactual MDPs, a new class of MDPs that allows the agent to reason and plan over the aforementioned counterfactual. Solving a Counterfactual MDP consists in the maximization

of the expected value/cost trade-off over possible changes to the world. In the context of MDPs, the dynamics of the world are described in terms of transition probabilities. Our approach is thus to formulate the problem as a joint optimization of the transition probabilities and optimal policy of the MDP. We analyze the complexity of the resulting problem, and formally prove it is NP-Hard. We then derive two gradient-based approaches for solving the problem. These approaches culminate in the contribution of an iterative gradient based algorithm, PITERATION,

for solving Counterfactual MDPs. Additionally, we discuss methods for scaling up this algorithm to larger problems.

We demonstrate the applicability of Counterfactual MDPs and the performance of the algorithms proposed in multiple scenarios. We show, in particular, significant

performance improvements that arise from allowing the agent to reason and plan over other possible worlds, and corresponding optimal policies. In the process we realize, however, that Counterfactual MDPs implicitly assume

that the specific world configuration the agent envisioned will be necessarily materialized. However, in many real-life scenarios there exists an underlying uncertainty in

the outcome of applying changes to the world. We extend the Counterfactual MDP model to allow the agent to reason over this uncertainty, and dub the resulting model

Stochastic Outcomes Counterfactual MDPs. This new model assumes the uncertainty associated with changes to the world follows a probabilistic distribution with parameters the agent can reason over and control, resulting in a new optimization problem. We show the gradient of this new optimization problem can be computed by solving an expectation, and thus propose a sampling based method for computing it. This allows us to extend P-ITERATION to this new class of problems with stochastic outcomes.

In the end we demonstrate the applicability of the new model in multiple scenarios with uncertainty in the outcome of changes to the world. We show that by reasoning over this uncertainty the agent is able to find more valuable world configurations.

History

Date

2020-08-01

Degree Type

Dissertation

Department

Computer Science

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Manuela Veloso Francisco S. Melo

Usage metrics

Keywords

Counterfactual Markov Decision Processes

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Counterfactual MDPs: Planning Beyond Direct Control

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports