posted on 2003-01-01, 00:00authored byJ. Andrew Bagnell, Jeff Schneider
Much recent work in reinforcement learning and
stochastic optimal control has focused on algorithms
that search directly through a space of
policies rather than building approximate value
functions. Policy search has numerous advantages:
it does not rely on the Markov assumption,
domain knowledge may be encoded in a
policy, the policy may require less representational
power than a value-function approximation,
and stable and convergent algorithms are
well-understood. In contrast with value-function
methods, however, existing approaches to policy
search have heretofore focused entirely on
parametric approaches. This places fundamental
limits on the kind of policies that can be
represented. In this work, we show how policy
search (with or without the additional guidance
of value-functions) in a ReproducingKernel
Hilbert Space gives a simple and rigorous extension
of the technique to non-parametric settings.
In particular, we investigate a new class of algorithms
which generalize REINFORCE-style likelihood
ratio methods to yield both online and
batch techniques that perform gradient search in
a function space of policies. Further, we describe
the computational tools that allow efficient implementation.
Finally, we apply our new techniques
towards interesting reinforcement learning
problems.