posted on 2004-01-01, 00:00authored byJ. Andrew Bagnell, Sham Kakade, Andrew Y Ng, Jeff Schneider
We consider the policy search approach to reinforcement learning. We
show that if a “baseline distribution” is given (indicating roughly how
often we expect a good policy to visit each state), then we can derive
a policy search algorithm that terminates in a finite number of steps,
and for which we can provide non-trivial performance guarantees. We
also demonstrate this algorithm on several grid-world POMDPs, a planar
biped walking robot, and a double-pole balancing problem.