Counterfactual MDPs: Planning Beyond Direct Control
thesisposted on 21.04.2021, 21:15 by Rui Silva
Planning under uncertainty using Markov decision processes (MDPs) requires a model of the environment specifying the probabilistic effects of the actions that
the agent is able to execute. The optimal policy of the agent is then computed from such model. As such, when planning, the agent assumes the environment may only change as the direct result of its actions. In this thesis we consider lifting that assumption, and allow the agent to reason over the counterfactual “What if the world were different?” Effectively, we allow the agent to reason over other possible configurations of the world, where more rewarding optimal policies may exist, and over the cost required in shifting the original environment to these modified worlds. Our
goal is to endow the agent with the ability to plan over the possibility of actually operating in such configurations of the world, if beneficial. We introduce Counterfactual MDPs, a new class of MDPs that allows the agent to reason and plan over the aforementioned counterfactual. Solving a Counterfactual MDP consists in the maximization
of the expected value/cost trade-off over possible changes to the world. In the context of MDPs, the dynamics of the world are described in terms of transition probabilities. Our approach is thus to formulate the problem as a joint optimization of the transition probabilities and optimal policy of the MDP. We analyze the complexity of the resulting problem, and formally prove it is NP-Hard. We then derive two gradient-based approaches for solving the problem. These approaches culminate in the contribution of an iterative gradient based algorithm, PITERATION,
for solving Counterfactual MDPs. Additionally, we discuss methods for scaling up this algorithm to larger problems.
We demonstrate the applicability of Counterfactual MDPs and the performance of the algorithms proposed in multiple scenarios. We show, in particular, significant
performance improvements that arise from allowing the agent to reason and plan over other possible worlds, and corresponding optimal policies. In the process we realize, however, that Counterfactual MDPs implicitly assume
that the specific world configuration the agent envisioned will be necessarily materialized. However, in many real-life scenarios there exists an underlying uncertainty in
the outcome of applying changes to the world. We extend the Counterfactual MDP model to allow the agent to reason over this uncertainty, and dub the resulting model
Stochastic Outcomes Counterfactual MDPs. This new model assumes the uncertainty associated with changes to the world follows a probabilistic distribution with parameters the agent can reason over and control, resulting in a new optimization problem. We show the gradient of this new optimization problem can be computed by solving an expectation, and thus propose a sampling based method for computing it. This allows us to extend P-ITERATION to this new class of problems with stochastic outcomes.
In the end we demonstrate the applicability of the new model in multiple scenarios with uncertainty in the outcome of changes to the world. We show that by reasoning over this uncertainty the agent is able to find more valuable world configurations.
- Doctor of Philosophy (PhD)