莫斯科大学深度强化学习算法综述：DQN、A2C与分布估计算法详解

下载需积分: 10 | PDF格式 | 5.48MB | 更新于2024-07-16 | 72 浏览量 | 举报

本资源名为《ReinforcementLearningMoscow.pdf》，由莫斯科罗蒙诺索夫国立大学的Sergey Ivanov撰写，专注于现代深度强化学习算法的介绍。作者在文章中探讨了强化学习的基本概念和核心算法，包括价值函数、不同类型的算法（如基于值的方法如DQN、Double DQN、Dueling DQN、Noisy DQN、Prioritized Experience Replay和Multi-step DQN，以及分布式价值方法，如Categorical DQN、Quantile Regression DQN (QR-DQN) 和 Rainbow DQN）。在强化学习问题的设置部分，作者强调了该领域的基本假设，例如智能体与环境的交互模型，以及目标设定，即通过最优策略最大化长期奖励。价值函数在此起着关键作用，它们用来衡量不同状态或动作的价值，帮助决策制定。价值基方法章节深入解析了Temporal Difference (TD) 学习，这是一种用于估计状态值或动作值的方法。其中，Deep Q-learning (DQN) 是重点，介绍了其深层神经网络结构以及如何解决传统Q-learning中的过拟合问题。双DQN（Double DQN）通过分离选择和评估网络来改进学习稳定性，而Dueling DQN则通过分离价值和优势估计，提高决策效率。 Noisy DQN引入噪声到网络参数，以探索未知环境，Prioritized Experience Replay 则通过优先处理具有高重要性的经验样本，提高学习效率。Multi-step DQN则考虑多个时间步的预测，以更好地估计长期收益。接下来，文章转向分布式方法，如Categorical DQN，它将状态值函数分解为离散的概率分布，提供更精确的估计算法。Quantile Regression DQN (QR-DQN) 通过估计多个可能的回报分布，增加了鲁棒性。最后，Rainbow DQN综合了上述多种改进，展示了强化学习算法的前沿进展。政策梯度算法部分阐述了 Policy Gradient Theorem，它是指导如何直接优化策略的基础。REINFORCE算法是基本的无模型策略梯度方法，而Advantage Actor-Critic (A2C) 则结合了价值函数和策略更新，两者协同工作，以实现更高效的策略学习。这份文档为读者提供了对强化学习理论和实践的深入理解，涵盖了从基础概念到当前最先进的算法，对于研究者和实践者来说是一份宝贵的学习资源。

3. Value-based algorithms

3.1. Temporal Diﬀerence learning

In this section we consider temporal diﬀerence learning algorithm [21, Chapter 6], which is a

classical Reinforcement Learning method in the base of modern value-based approach in DRL.

The ﬁrst idea behind this algorithm is to search for optimal Q-function Q

∗

(s, a) by solving a

system of recursive equations which can be derived by recalling interconnection between Q-function

and value function (3):

(s, a) = E

∼p(s

|s,a)

[r(s

) + γV

)] =

= {using (4)} = E

∼p(s

|s,a)



r(s

) + γE

∼π(a

)

, a

)



This equation, named Bellman equation, remains true for value functions under any policies

including optimal policy π

∗

(s, a) = E

∼p(s

|s,a)



r(s

) + γE

∼π(a

)

∗

, a

)



(5)

Recalling proposition 3, optimal (deterministic) policy can be represented as π

∗

(s) = argmax

∗

(s, a). Substituting this for π

∗

(s) in (5), we obtain fundamental Bellman optimality equation:

Proposition 5. (Bellman optimality equation)

∗

(s, a) = E

∼p(s

|s,a)

r(s

) + γ max

∗

, a

)

(6)

The straightforward utilization of this result is as follows. Consider the tabular case, when both

state space S and action space A are ﬁnite (and small enough to be listed in computer memory).

Let us also assume for now that transition probabilities are available to training procedure. Then

∗

(s, a) : S × A → R can be represented as a ﬁnite table with |S||A| numbers. In this case (6)

just gives a set of |S||A| equations for this table to satisfy.

Addressing the values of the table as unknown variables, this system of equations can be solved

using basic point iteration method: let Q

∗

(s, a) be initial arbitrary values of table (with the only

exception that for terminal states s ∈ S

, if any, Q

∗

(s, a) = 0 for all actions a). On each iteration t

the table is updated by substituting current values of the table to the right side of equation until the

process converges:

∗

t+1

(s, a) = E

∼p(s

|s,a)

r(s

) + γ max

∗

, a

)

(7)

This straightforward approach of learning the optimal Q-function, named Q-learning, has been

extensively studied in classical Reinforcement Learning. One of the central results is presented in

the following convergence theorem:

Proposition 6. Let by B denote an operator (S × A → R) → (S × A → R), updating Q

∗

as in

(7):

∗

t+1

= BQ

∗

for all state-action pairs s, a.

Then B is a contraction mapping, i. .e. for any two tables Q

, Q

∈ (S × A → R)

kBQ

− BQ

∞

≤ γkQ

− Q

∞

Therefore, there is a unique ﬁxed point of the system of equations (7) and the point iteration method

converges to it.

The contraction mapping property is actually of high importance. It demonstrates that the point

iteration algorithm converges with exponential speed and requires small amount of iterations. As

the true Q

∗

is a ﬁxed point of (6), the algorithm is guaranteed to yield a correct answer. The trick is

that each iteration demands full pass across all state-action pairs and exact computation of expec-

tations over transition probabilities.

In general case, these expectations can’t be explicitly computed. Instead, agent is restricted to

samples from transition probabilities gained during some interaction experience. Temporal Diﬀer-

ence (TD)

algorithm proposes to collect this data using π

= argmax

∗

(s, a) ≈ π

∗

and after

each gathered transition (s

, a

, r

t+1

, s

t+1

) update only one cell of the table:

∗

t+1

(s, a) =







(1 − α

∗

(s, a) + α

t+1

+ γ max

∗

t+1

, a

)

if s = s

, a = a

∗

(s, a) else

(8)

where α

∈ (0, 1) plays the role of exponential smoothing parameter for estimating expectation

∼p(s

)

(·) from samples.

Two key ideas are introduced in the update formula (8): exponential smoothing instead of exact

expectation computation and cell by cell updates instead of updating full table at once. Both are

required to settle Q-learning algorithm for online application.

As the set S

of terminal states in online setting is usually unknown beforehand, a slight modiﬁ-

cation of update (8) is used. If observed next state s

turns out to be terminal (recall the convention

to denote this by ﬂag done), its value function is known to be equal to zero:

∗

) = max

∗

, a

) = 0

This knowledge is embedded in the update rule (8) by multiplying max

∗

t+1

, a

) on (1 −

done

t+1

). For the sake of shortness, this factor is often omitted but should be always present in

implementations.

Second important note about formula (8) is that it can be rewritten in the following equivalent

way:

∗

t+1

(s, a) =







∗

(s, a) + α

t+1

+ γ max

∗

t+1

, a

) − Q

∗

(s, a)

if s = s

, a = a

∗

(s, a) else

(9)

The expression in the brackets, referred to as temporal diﬀerence, represents a diﬀerence be-

tween Q-value Q

∗

(s, a) and its one-step approximation r

t+1

+ γ max

∗

t+1

, a

), which must be

zero in expectation for true optimal Q-function.

The idea of exponential smoothing allows us to formulate ﬁrst practical algorithm which can work

in the tabular case with unknown world dynamics:

Algorithm 1: Temporal Diﬀerence algorithm

Hyperparameters: α

∈ (0, 1)

Initialize Q

∗

(s, a) arbitrary

On each interaction step:

1. select a = argmax

∗

(s, a)

2. observe transition (s, a, r

, s

, done)

3. update table:

∗

(s, a) ← Q

∗

(s, a) + α

+ (1 − done)γ max

∗

, a

) − Q

∗

(s, a)

It turns out that under several assumptions on state visitation during interaction process this

procedure holds similar properties in terms of convergence guarantees, which are stated by the

following theorem:

also known as TD(0) due to theoretical generalizations

Proposition 7. [26] Let’s deﬁne

(s, a) =

(

(s, a) is updated on step t

0 otherwise

Then if for every state-action pair (s, a)

+∞

(s, a) = ∞

+∞

(s, a)

< ∞

the algorithm 1 converges to optimal Q

∗

with probability 1.

This theorem states that basic policy iteration method can be actually applied online in the way

proposed by TD algorithm, but demands «enough exploration» from the strategy of interacting with

MDP during training. Satisfying this demand remains a unique and common problem of reinforce-

ment learning.

The widespread kludge is ε-greedy strategy which basically suggests to choose random action

instead of a = argmax

∗

(s, a) with probability ε

. The probability ε

is usually set close to 1

during ﬁrst interaction iterations and scheduled to decrease to a constant close to 0. This heuristic

makes agent visit all states with non-zero probabilities independent of what current approximation

∗

(s, a) suggests.

The main practical issue with Temporal Diﬀerence algorithm is that it requires table Q

∗

(s, a) to

be explicitly stored in memory, which is impossible for MDP with high state space complexity. This

limitation substantially restricted its applicability until its combination with deep neural network was

proposed.

3.2. Deep Q-learning (DQN)

Utilization of neural nets to model either a policy or a Q-function frees from constructing task-

speciﬁc features and opens possibilities of applying RL algorithms to complex tasks, e. g. tasks with

images as input. Video games are classical example of such tasks where raw pixels of screen are

provided as state representation and, correspondingly, as input to either policy or Q-function.

Main idea of Deep Q-learning [13] is to adapt Temporal Diﬀerence algorithm so that update for-

mula (9) would be equivalent to gradient descent step for training a neural network to solve a certain

regression task. Indeed, it can be noticed that the exponential smoothing parameter α

resembles

learning rate of ﬁrst-order gradient optimization procedures, while the exploration conditions from

theorem 7 look identical to restrictions on learning rate of stochastic gradient descent.

The key hint is that (9) is actually a gradient descent step in the parameter space of the table

functions family:

∗

(s, a, θ) = θ

s,a

where all θ

s,a

form a vector of parameters θ ∈ R

|S||A|

To unravel this fact, it is convenient to introduce some notation from regression tasks. First, let’s

denote by y the target of our regression task, i. e. the quantity that our model is trying to predict:

y(s, a)

= r(s

) + γ max

∗

, a

, θ) (10)

where s

is a sample from p(s

| s, a) and s, a is input data. In this notation (9) is equivalent to:

t+1

= θ

+ α

[y(s, a) − Q

∗

(s, a, θ

)] e

s,a

where we multiplied scalar value α

[y(s, a) − Q

∗

(s, a, θ

)] on the following vector e

s,a

i,j

(

1 (i, j) = (s, a)

0 (i, j) 6= (s, a)

to formulate an update of only one component of θ in a vector form. By this we transitioned to

update in parameter space using Q

∗

(s, a, θ) = θ

s,a

. Remark that for table functions family the

剩余55页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

困困的

粉丝: 1

莫斯科大学深度强化学习算法综述：DQN、A2C与分布估计算法详解

人脸识别_深度学习_CNN_表情分析系统_1741778057.zip

Hono框架下基于TypeScript的Web应用构建指南：从项目初始化到模块全面实现（可复现，有问题请联系博主）

掌静脉识别算法源码（门禁）.zip

计算机视觉_手势识别_色域转换_控制应用_1741857836.zip

（参考GUI）MATLAB BP的交通标志系统.zip

人脸识别_Hadoop_视频图像检索_安防辅助系统_1741777456.zip

C++函数全解析：从基础入门到高级特性的编程指南

Comsol光学仿真模型：包括纳米球 柱 Mie散射多级分解 ,Comsol光学仿真模型; 纳米球; 柱; Mie散射; 多级分解,Comsol光学仿真模型：纳米结构Mie散射多级分解

永磁同步电机全速域控制高频方波注入法、滑模观测器法SMO、加权切矢量控制Simulink仿真模型 低速域采用高频方波注入法HF，高速域采用滑膜观测器法SMO，期间采用加权形式切 送前方法 1、零低速

基于蜣螂优化算法的无人机三维路径规划【23年新算法应用】可直接运行 Matlab语言 主要内容：读取地形数据，利用蜣螂算法DBO优化三维路径，目标函数为总路径最短，同时不能撞到障碍物，效果如图所示

最新资源

Comsol光学仿真模型：包括纳米球柱 Mie散射多级分解 ,Comsol光学仿真模型; 纳米球; 柱; Mie散射; 多级分解,Comsol光学仿真模型：纳米结构Mie散射多级分解

永磁同步电机全速域控制高频方波注入法、滑模观测器法SMO、加权切矢量控制Simulink仿真模型低速域采用高频方波注入法HF，高速域采用滑膜观测器法SMO，期间采用加权形式切送前方法 1、零低速

基于蜣螂优化算法的无人机三维路径规划【23年新算法应用】可直接运行 Matlab语言主要内容：读取地形数据，利用蜣螂算法DBO优化三维路径，目标函数为总路径最短，同时不能撞到障碍物，效果如图所示