1.7. Markov decision process Draft, S. Zhao, 2022
policy would be short-sighted. If γ is close to 1, then the agent put more emphasis on the
far future rewards. In this case, the resulting policy would dare to take risks of getting
negative rewards in the near future. These points can be well demonstrated later by the
examples in Section 3.5 in Chapter 3.
One important notion that was not explicitly mentioned in the above discussion is
episode. When interacting with the environment following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode or a trial. If the
environment or policy is stochastic, we would obtain different episodes starting from the
same state. However, if everything is deterministic, we may always obtain the same
episode starting from the same state.
An episode is usually assumed to be a finite trajectory. Tasks with episodes are called
episodic tasks. However, some tasks may have no terminal states, meaning the interaction
with the environment will never end. Such tasks are called continuing tasks. In fact, we
can treat episodic and continuing tasks in a unified mathematical way by converting
episodic tasks to continuing tasks. The key is to well define the process after reaching
the target/terminal state. Specifically, after reaching the target or terminal state in an
episodic task, the agent can continue taking actions. We can treat the target/terminal
state in two ways.
First, if we treat it as a special state, we can specially design its action space or state
transition such that the agent stays at this state forever. Such states are called absorbing
states, meaning that the agent would never leave the state once it is reached. For example,
for the target state s
9
, we can specify A(s
9
) = {a
5
} or set A(s
9
) = {a
1
, . . . , a
5
} but
p(s
9
|s
9
, a
i
) = 1 for all i = 1, . . . , 5. We can also set the reward obtained after reaching s
9
as always zero.
Second, if we treat the target state as a normal one, we can simply set its action space
the same as others and the agent may leave the state. Since a positive reward of r = 1 can
be obtained every time s
9
is reached, the agent will eventually learn to stay at s
9
forever
to collect more rewards. Of course, when the episode is infinitely long and the reward of
staying at s
9
is positive, a discount rate must be used to calculate the discounted return
to avoid divergence. In this book, we consider the second scenario where the target state
is treated as a normal state whose action space is A(s
9
) = {a
1
, . . . , a
5
}.
1.7 Markov decision process
The previous sections of this chapter have illustrated some fundamental concepts in RL by
examples. This section presents these concepts in a more formal way under the framework
of the Markov decision processes (MDP).
MDP is a general framework to describe stochastic dynamical systems. The key
ingredients of an MDP are listed below.
– Sets:
17
评论0