2 CHAPTER 1. INTRODUCTION
formalization must wait until Chapter 3, but the basic idea is simply to capture the most important
aspects of the real problem facing a learning agent interacting over time with its environment to achieve
a goal. A learning agent must be able to sense the state of its environment to some extent and must be
able to take actions that affect the state. The agent also must have a goal or goals relating to the state of
the environment. Markov decision processes are intended to include just these three aspects—sensation,
action, and goal—in their simplest possible forms without trivializing any of them. Any method that
is well suited to solving such problems we consider to be a reinforcement learning method.
Reinforcement learning is different from supervised learning, the kind of learning studied in most
current research in the field of machine learning. Supervised learning is learning from a training set
of labeled examples provided by a knowledgable external supervisor. Each example is a description of
a situation together with a specification—the label—of the correct action the system should take to
that situation, which is often to identify a category to which the situation belongs. The object of this
kind of learning is for the system to extrapolate, or generalize, its responses so that it acts correctly
in situations not present in the training set. This is an important kind of learning, but alone it is
not adequate for learning from interaction. In interactive problems it is often impractical to obtain
examples of desired behavior that are both correct and representative of all the situations in which the
agent has to act. In uncharted territory—where one would expect learning to be most beneficial—an
agent must be able to learn from its own experience.
Reinforcement learning is also different from what machine learning researchers call unsupervised
learning, which is typically about finding structure hidden in collections of unlabeled data. The terms
supervised learning and unsupervised learning would seem to exhaustively classify machine learning
paradigms, but they do not. Although one might be tempted to think of reinforcement learning as a
kind of unsupervised learning because it does not rely on examples of correct behavior, reinforcement
learning is trying to maximize a reward signal instead of trying to find hidden structure. Uncovering
structure in an agent’s experience can certainly be useful in reinforcement learning, but by itself does
not address the reinforcement learning problem of maximizing a reward signal. We therefore consider
reinforcement learning to be a third machine learning paradigm, alongside supervised learning and
unsupervised learning and perhaps other paradigms as well.
One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the
trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning
agent must prefer actions that it has tried in the past and found to be effective in producing reward.
But to discover such actions, it has to try actions that it has not selected before. The agent has to
exploit what it has already experienced in order to obtain reward, but it also has to explore in order to
make better action selections in the future. The dilemma is that neither exploration nor exploitation
can be pursued exclusively without failing at the task. The agent must try a variety of actions and
progressively favor those that appear to be best. On a stochastic task, each action must be tried many
times to gain a reliable estimate of its expected reward. The exploration–exploitation dilemma has been
intensively studied by mathematicians for many decades, yet remains unresolved. For now, we simply
note that the entire issue of balancing exploration and exploitation does not even arise in supervised
and unsupervised learning, at least in their purest forms.
Another key feature of reinforcement learning is that it explicitly considers the whole problem of a
goal-directed agent interacting with an uncertain environment. This is in contrast to many approaches
that consider subproblems without addressing how they might fit into a larger picture. For example, we
have mentioned that much of machine learning research is concerned with supervised learning without
explicitly specifying how such an ability would finally be useful. Other researchers have developed
theories of planning with general goals, but without considering planning’s role in real-time decision
making, or the question of where the predictive models necessary for planning would come from. Al-
though these approaches have yielded many useful results, their focus on isolated subproblems is a
significant limitation.