Carnegie Mellon University
boy1_Tepper_2022.pdf (1.22 MB)
Download file

A Pathwise Optimization Approach for Reinforcement Learning in Merchant Energy Operations

Download (1.22 MB)
posted on 2022-07-12, 20:46 authored by Bo YangBo Yang

This thesis studies the management of energy conversion assets. Examples include oil and biorefineries, ethanol production plants, transportation pipelines, and natural gas processing and storage facilities. The merchant operations approach formulates the management of these assets as real option models, which provide a convenient way to maximize the market value of the conversion assets. However, the real option approach gives rise to an intractable Markov decision process (MDP) due to intertemporal linkages between decisions and the high dimensionality of the market state variables when using realistic price models. We consider reinforcement learning (RL) approaches that rely on basis-function-based value function approximations to compute a feasible policy and a lower bound on the MDP optimal policy value, as well as a dual (upper) bound on the latter quantity. In particular, we focus on pathwise optimization (PO), because, in theory, it generates the tightest dual bound for a given set of basis functions. 

We first extend PO from optimal stopping to merchant energy production. It is known that applying least squares Monte Carlo (LSM), a state-of-the-art RL approach for financial and real option valuation, in conjunction with information relaxation and duality techniques to realistic merchant ethanol production instances results in sizable (about 13%) average optimality gaps. Our research aims at reducing these gaps by extending the PO applicability from optimal stopping to merchant energy production. This extension rests on formulating a novel PO linear program (PLP), which is difficult to solve optimally because it is both ill conditioned and large scale. We develop an effective preconditioning approach based on principal component analysis (PCA). We mitigate the dimensionality concern by proposing a provably convergent block coordinate descent (BCD) technique. On the known set of merchant ethanol production instances, PO leads to considerably smaller (roughly 7%) average optimality gaps compared to LSM but entails a noticeably larger average computational effort (eleven hours rather than one hour). The optimality gap reduction is almost entirely attributable to the stronger PO-based dual bounds relative to the LSM-based ones. We thus bring to light the near optimality of the LSM-based policy. 

Next, we put forth a constraint generation method to solve PLP. The proposed method iterates between a master and a subproblem. The master problem combines subsets of iii the constraints and variables of the model to obtain a relaxed linear program that an off-the-shelf solver can handle. The subproblem strengthens this relaxation by efficiently identifying constraint violations in the original linear program. This method provably converges. It can be stopped once the current solution is sufficiently close to the optimal one, which we can check using a bound computed by solving the subproblem, or when it leads to good-enough bounds on the MDP optimal policy value. We applied our approach to the benchmark merchant ethanol production instances used in the previous chapter. The numerical results suggest that the proposed method generates near optimal lower and dual bounds comparable to the BCD approach with less memory but longer CPU time. 

Finally, we extend PO to merchant energy operations applications beyond production, where the PO models considered previously may lead to unbounded PLPs. Specifically, unboundedness occurs in these applications if no feasible decision can terminate the MDP at each controllable state. In other words, these MDPs lack an analogue of the stop decision in optimal stopping. We propose a novel pseudo action scheme to deal with this issue. The pseudo action scheme bounds the PLP by adding artificial actions from nonanticipative policies, i.e., policies that do not utilize future information. BCD and constraint generation can be applied to solve this bounded PLP. We also develop a coordinated decomposition and regression approach to reduce the computational complexity of tackling PLPs. Our approach (i) solves the PLP dual via coordinated decomposition (alternating direction method of multipliers) and (ii) recovers an associated primal solution by approximately enforcing complementary slackness via two norm regression. Compared to BCD and constraint generation, coordinated decomposition features lower per iteration computational complexity because it exploits the problem’s structure. We conduct numerical studies on realistic merchant energy storage and production instances. For energy storage, obtaining a bounded PLP model requires using pseudo actions. Gurobi directly solves this PLP. The resulting lower and dual bounds are near optimal and essentially match the ones we obtain by applying LSM, which is known to exhibit excellent performance in this application. For energy production, PLP is bounded without pseudo actions but Gurobi cannot directly handle it. In contrast, on existing instances coordinated decomposition solves it with substantially less computational effort (85% reduction in memory and 50% reduction in CPU time) and generates similar bounds compared to both BCD and constraint generation. Further, it achieves near optimal performance on larger instances that are out of reach for these methods. 




Degree Type

  • Dissertation


  • Tepper School of Business

Degree Name

  • Doctor of Philosophy (PhD)


Nicola Secomandi; Selvaprabu Nadarajah

Usage metrics