"强化学习数学基础：贝尔曼方程到actor-critic方法"

需积分: 0 173 浏览量更新于2024-01-03 1 收藏 1.68MB PDF 举报

本书《强化学习的数学基础》的作者赵世钰是西湖大学工学院智能无人系统实验室主任。他撰写这份资料的目的是为了弥补现有教材的不足。书中共分为10章，内容涵盖了强化学习的数学基础。以下是各章节的简要介绍：第2章：贝尔曼方程，是分析状态值的基本工具。贝尔曼方程描述了如何通过当前状态的值函数和下一状态的值函数来计算当前状态的值函数。第3章：贝尔曼最优方程，是一个特殊的贝尔曼方程。贝尔曼最优方程描述了在最优策略下的最优状态值函数和最优动作值函数之间的关系。第4章：值迭代算法，是一种求解贝尔曼最优方程的算法。值迭代算法通过反复迭代更新状态值函数来逼近最优状态值函数。第5章：蒙特卡罗学习，是第4章策略迭代算法的扩展。蒙特卡罗学习是一种基于采样的学习方法，通过模拟多个完整的轨迹来估计状态值函数或动作值函数。第6章：随机逼近的基础知识。本章介绍了随机逼近的基本概念和方法，包括线性回归、多项式逼近和神经网络逼近。第7章：时差学习，第6章是本章的基础。时差学习是一种基于差分逼近的学习方法，通过逐步更新状态值函数来逼近最优值函数。第8章：扩展了表格时间对价值函数逼近情况的差分学习方法。本章介绍了一些扩展了表格时间对价值函数逼近情况的差分学习方法，包括线性函数逼近、基于样本的学习和基于模型的学习。第9章：策略迭代。策略迭代是一种通过交替进行策略评估和策略改进来求解最优策略的方法。第10章：actor-critic 方法。actor-critic 方法是一种结合了值函数逼近和策略改进的方法，通过同时学习策略和值函数来实现更好的性能。这本书的目标读者是对强化学习感兴趣的学生和研究人员。通过学习这本书，读者将能够掌握强化学习的数学基础，了解贝尔曼方程、值迭代算法、蒙特卡罗学习等关键概念和方法，并能够应用到实际问题中。这将为他们在强化学习领域的研究和应用提供坚实的数学基础。

1.6. Trajectory, return, and episode Draft, S. Zhao, 2022

or encourage an action instead of the next state. For example, suppose the current state

is s

. Although taking actions a

and a

will both lead to the next state as s

, taking

is worse than a

because a

attempts to collide to the boundary and should be given

negative rewards.

1.6 Trajectory, return, and episode

s1 s2 s3

s4 s5 s6

s7 s8 s9

r=0

r=1

(a)

s1 s2 s3

s4 s5 s6

s7 s8 s9

r=0

r=-1

r=0

r=+1

(b)

Figure 1.5: Trajectories obtained following two policies. The trajectories are indicated by red dashed

lines.

A trajectory is a state-action-reward chain.

For example, given the policy shown in Figure 1.5(a), starting from s

, the agent

follows a trajectory as

−−→

r=0

−−→

r=0

−−→

r=0

−−→

r=1

The return of this trajectory is the sum of all the rewards collected along the trajectory:

return = 0 + 0 + 0 + 1 = 1. (1.1)

Return is also sometimes called total rewards or cumulative rewards.

Return can be used to evaluate the “goodness” of policies. For example, we can

compare the two policies in Figure 1.5 by comparing their returns. In particular, starting

from s

, the return obtained by the left policy is 1 as calculated above. For the right

policy, starting from s

gives the trajectory as

−−→

r=0

−−−→

r=−1

−−→

r=0

−−−→

r=+1

The corresponding return is:

return = 0 − 1 + 0 + 1 = 0.

1.6. Trajectory, return, and episode Draft, S. Zhao, 2022

The values of the returns for the two policies indicate that the left policy is better than

the right one since its return is greater. This mathematical conclusion is also consistent

with the intuition that the right policy is not good since it passes through a forbidden

cell.

While reward only reﬂects the encouragement or discouragement for taking a single

action, return can be used to evaluate a sequence of actions in a long run. A return

consists of the immediate reward and the delayed reward. Here, the immediate reward

is the reward obtained after taking an action at the initial state; the delayed reward is

the sum of the rewards obtained after leaving the initial state. Although the immediate

reward may be negative, the delayed reward may be positive. Thus, which actions to

take should be determined by the return (i.e., total reward) rather than the immediate

reward to avoid short-sighted decisions. As we will see in the next chapter, return plays

an important role to evaluate diﬀerent policies.

The return in (1.1) is deﬁned for a ﬁnite-length trajectory. Return can also be deﬁned

for inﬁnitely long trajectories. For example, the trajectory in Figure 1.5 stops after

reaching s

. Such a stop relies on a stop criterion: that is, the agent stops moving after

reaching the target state. Since the policy is also well deﬁned for the target state s

, the

agent does not have to stop after reaching s

. Then, we obtain the following inﬁnitely

long trajectory:

−−→

r=0

−−→

r=0

−−→

r=0

−−→

r=1

−−→

r=1

−−→

r=1

. . .

The direct summation of the rewards obtained along this trajectory is

return = 0 + 0 + 0 + 1 + 1 + 1 + ··· = ∞,

which unfortunately diverges. Therefore, we need to introduce the concept of discounted

return for inﬁnitely long trajectories. In particular, the discounted return is the sum of

the discounted rewards:

discounted return = 0 + γ0 + γ

0 + γ

1+γ

1 + γ

1 + . . ., (1.2)

where γ ∈ (0, 1) is called the discount rate. When γ ∈ (0, 1), (1.2) is ﬁnite and can be

calculated as

discounted return = γ

(1 + γ + γ

+ . . . ) = γ

1 − γ

The reason that we consider discounted return is twofold. First, it removes the stop

criterion and the mathematics is more elegant. Second, the discount rate can be used to

adjust the emphasis on near or far future rewards. In particular, if γ is close to 0, then

the user put more emphasis on the reward obtained in the near future. The resulting

1.7. Markov decision process Draft, S. Zhao, 2022

policy would be short-sighted. If γ is close to 1, then the agent put more emphasis on the

far future rewards. In this case, the resulting policy would dare to take risks of getting

negative rewards in the near future. These points can be well demonstrated later by the

examples in Section 3.5 in Chapter 3.

One important notion that was not explicitly mentioned in the above discussion is

episode. When interacting with the environment following a policy, the agent may stop

at some terminal states. The resulting trajectory is called an episode or a trial. If the

environment or policy is stochastic, we would obtain diﬀerent episodes starting from the

same state. However, if everything is deterministic, we may always obtain the same

episode starting from the same state.

An episode is usually assumed to be a ﬁnite trajectory. Tasks with episodes are called

episodic tasks. However, some tasks may have no terminal states, meaning the interaction

with the environment will never end. Such tasks are called continuing tasks. In fact, we

can treat episodic and continuing tasks in a uniﬁed mathematical way by converting

episodic tasks to continuing tasks. The key is to well deﬁne the process after reaching

the target/terminal state. Speciﬁcally, after reaching the target or terminal state in an

episodic task, the agent can continue taking actions. We can treat the target/terminal

state in two ways.

First, if we treat it as a special state, we can specially design its action space or state

transition such that the agent stays at this state forever. Such states are called absorbing

states, meaning that the agent would never leave the state once it is reached. For example,

for the target state s

, we can specify A(s

) = {a

} or set A(s

) = {a

, . . . , a

} but

p(s

, a

) = 1 for all i = 1, . . . , 5. We can also set the reward obtained after reaching s

as always zero.

Second, if we treat the target state as a normal one, we can simply set its action space

the same as others and the agent may leave the state. Since a positive reward of r = 1 can

be obtained every time s

is reached, the agent will eventually learn to stay at s

forever

to collect more rewards. Of course, when the episode is inﬁnitely long and the reward of

staying at s

is positive, a discount rate must be used to calculate the discounted return

to avoid divergence. In this book, we consider the second scenario where the target state

is treated as a normal state whose action space is A(s

) = {a

, . . . , a

1.7 Markov decision process

The previous sections of this chapter have illustrated some fundamental concepts in RL by

examples. This section presents these concepts in a more formal way under the framework

of the Markov decision processes (MDP).

MDP is a general framework to describe stochastic dynamical systems. The key

ingredients of an MDP are listed below.

– Sets:

1.7. Markov decision process Draft, S. Zhao, 2022

◦ State set: the set of all states, denoted as S.

◦ Action set: a set of actions, denoted as A(s), is associated for each state s ∈ S.

◦ Reward set: a set of rewards, denoted as R(s, a), is associated for each state action

pair (s, a).

– Model:

◦ State transition probability: at state s, taking action a, the probability to transit to

state s

is p(s

|s, a).

◦ Reward transition probability: at state s, taking action a, the probability to get

reward r is p(r|s, a).

– Policy: at state s, the probability to choose action a is π(a|s).

– Markov property: One key property of MDPs is the Markov property, which refers to

the memoryless property of a stochastic process. Mathematically, it means

p(s

t+1

, a

, s

t−1

, a

t−1

, . . . , s

, a

) = p(s

t+1

, a

p(r

t+1

, a

, s

t−1

, a

t−1

, . . . , s

, a

) = p(r

t+1

, a

). (1.3)

That is, the next state or reward merely depends on the present state and action rather

than the previous states or actions. Equation (1.3) indicates the property of conditional

independence of random variables. Preliminaries to the probability theory can be found

in the appendix. Markov property is important for deriving the fundamental Bellman

equation of MDPs as shown in the next chapter.

The state transition, reward transition, and policy can be all stochastic, although they

are sometimes deterministic in our illustrative examples. Here, p(s

|s, a) and p(r|s, a) for

all s, a are called the model or dynamics of an MDP. We will show later in the book

that there are model-based and model-free RL algorithms. Moreover, the model can be

either stationary or nonstatinary, or in other words, time-variant or time-invariant. In

stationary environments, the models do not change over time; in nonstationary envi-

ronments, the models may vary over time. For instance, in the example of grid world,

if some forbidden areas may pop up or disappear in the grid, such an environment is

nonstationary.

The reader may have also heard about the Markov process (MP). What is the diﬀer-

ence between MDP and MP? The answer is that, once the policy in an MDP is ﬁxed, the

MDP degenerates to an MP. In the literature on stochastic processes, a Markov process

is also called a Markov chain if it is discrete-time and the number of states is ﬁnite or

countable [1]. In our book, the Markov process and Markov chain are used interchange-

ably when the context is clear. In the ﬁrst half of this book, we consider ﬁnite MDPs

where the numbers of states, actions, and reward values are all ﬁnite. This is the simplest

case that should be well understood ﬁrst of all. Towards the end of this book, we will

剩余225页未读，继续阅读

ShowMeAI

粉丝: 6282
资源: 42

"强化学习数学基础：贝尔曼方程到actor-critic方法"

强化学习 reinforcement learning

强化学习(reinforcement learning)

强化学习基础.rar

Mathmatical Analysis of Random Noise

2021华为杯数学建模_MathMatical-2021.rar

mathmatical课件

mathmatical analysis

Mathmatical-Programming:用编程解决数学问题！

seir传染病模型matlab代码-Mathmatical_Modelling_of_an_Infectious_Disease_Simula

Mathmatical-Practice-with-R:ZJU数学实践守则

最新资源