posted on 1999-06-01, 00:00authored byGeoffrey J. Gordon
One of the basic problems of machine learning is deciding how to act in an
uncertain world. For example, if I want my robot to bring me a cup of coffee, it
must be able to compute the correct sequence of electrical impulses to send to
its motors to navigate from the coffee pot to my
office. In fact, since the results
of its actions are not completely predictable, it is not enough just to compute
the correct sequence; instead the robot must sense and correct for deviations
from its intended path.
In order for any machine learner to act reasonably in an uncertain environment,
it must solve problems like the above one quickly and reliably. Unfortunately,
the world is often so complicated that it is difficult or impossible to find
the optimal sequence of actions to achieve a given goal. So, in order to scale
our learners up to real-world problems, we usually must settle for approximate
solutions.
One representation for a learner's environment and goals is a Markov decision
process or MDP. MDPs allow us to represent actions that have probabilistic
outcomes, and to plan for complicated, temporally-extended goals. An MDP
consists of a set of states that the environment can be in, together with rules
for how the environment can change state and for what the learner is supposed
to do.
One way to approach a large MDP is to try to compute an approximation
to its optimal state evaluation function, the function which tells us how much
reward the learner can be expected to achieve if the world is in a particular
state. If the approximation is good enough, we can use a shallow search to find
a good action from most states. Researchers have tried many different ways
to approximate evaluation functions. This thesis aims for a middle ground,
between algorithms that don't scale well because they use an impoverished representation
for the evaluation function and algorithms that we can't analyze
because they use too complicated a representation.