深度强化学习的分布视角：理解Distributional RL

需积分: 31 69 浏览量更新于2024-07-17 收藏 1.63MB PDF 举报

"这篇资源是关于深度强化学习的原始论文——《A Distributional Perspective on Reinforcement Learning》，适合初学者理解并探索分布视角下的强化学习。论文着重强调了奖励分布的重要性，而不仅仅是期望值，这对于强化学习算法的稳定性和性能有深远影响。作者通过理论分析和实证研究，提出了一个新的基于贝尔曼方程的学习算法，用于估计近似的价值分布，并在Arcade Learning Environment的游戏集上进行了评估，取得了最先进的结果。" 《A Distributional Perspective on Reinforcement Learning》这篇论文的核心观点在于重新审视强化学习的视角，它不再仅仅关注于学习期望的回报（即价值），而是提出应当重视奖励的随机分布。强化学习的典型方法是通过学习预期回报来优化策略，但这可能会忽略回报分布中的重要信息，特别是在面对不确定性时。论文首先从理论上探讨了在策略评估和控制设置中奖励分布的作用。在策略控制场景下，作者揭示了价值分布存在显著的不稳定性问题，这可能导致学习过程的不稳定。为解决这个问题，论文引入了分布视角，即考虑奖励的全概率分布，而非仅仅关注期望值。接下来，论文提出了一个创新的算法，该算法应用贝尔曼方程到价值分布的学习中。贝尔曼方程是强化学习的基础，通常用于更新状态值函数。但在分布视角下，贝尔曼方程可以用于更新价值分布，这使得算法能够更好地捕捉回报的不确定性和风险。为了验证这一新方法的有效性，作者在Arcade Learning Environment (ALE) 的一系列游戏中进行了实验。ALE是一个广泛使用的强化学习环境，包含了多个经典街机游戏，能充分测试算法的泛化能力和适应性。实验结果显示，这个新算法不仅在性能上达到了当前最优水平，还提供了直观的证据，证明了奖励分布对于近似强化学习的重要性。这篇论文挑战了强化学习的传统范式，强调了奖励分布的中心地位，并提出了一种新的、有效的方法来处理和利用这种分布信息。这对强化学习领域的理论发展和实践应用具有重要的启示作用，尤其是在应对高风险和不确定性环境时。

A Distributional Perspective on Reinforcement Learning

a) The randomness in the reward R,

b) The randomness in the transition P

, and

c) The next-state value distribution Z(X

, A

In particular, we make the usual assumption that these three

quantities are independent. In this section we will show

that (5) is a contraction mapping whose unique ﬁxed point

is the random return Z

3.3.1. CONTRACTION IN

Consider the process Z

k+1

:= T

, starting with some

∈ Z. We may expect the limiting expectation of {Z

}

to converge exponentially quickly, as usual, to Q

. As we

now show, the process converges in a stronger sense: T

is a contraction in

, which implies that all moments also

converge exponentially quickly.

Lemma 3. T

: Z → Z is a γ-contraction in

Using Lemma 3, we conclude using Banach’s ﬁxed point

theorem that T

has a unique ﬁxed point. By inspection,

this ﬁxed point must be Z

as deﬁned in (1). As we assume

all moments are bounded, this is sufﬁcient to conclude that

the sequence {Z

} converges to Z

for 1 ≤ p ≤ ∞.

To conclude, we remark that not all distributional metrics

are equal; for example, Chung & Sobel (1987) have shown

that T

is not a contraction in total variation distance. Sim-

ilar results can be derived for the Kullback-Leibler diver-

gence and the Kolmogorov distance.

3.3.2. CONTRACTION IN CENTERED MOMENTS

Observe that d

(U, V ) (and more generally, d

) relates to a

coupling C(ω) := U(ω) − V (ω), in the sense that

(U, V ) ≤ E[(U − V )

] = V







E C



As a result, we cannot directly use d

to bound the variance

difference |V(T

Z(x, a)) − V(Z

(x, a))|. However, T

is in fact a contraction in variance (Sobel, 1982, see also

appendix). In general, T

is not a contraction in the p

centered moment, p > 2, but the centered moments of the

iterates {Z

} still converge exponentially quickly to those

of Z

; the proof extends the result of R

osler (1992).

3.4. Control

Thus far we have considered a ﬁxed policy π, and studied

the behaviour of its associated operator T

. We now set

out to understand the distributional operators of the control

setting – where we seek a policy π that maximizes value

– and the corresponding notion of an optimal value distri-

bution. As with the optimal value function, this notion is

intimately tied to that of an optimal policy. However, while

all optimal policies attain the same value Q

∗

, in our case

a difﬁculty arises: in general there are many optimal value

distributions.

In this section we show that the distributional analogue

of the Bellman optimality operator converges, in a weak

sense, to the set of optimal value distributions. However,

this operator is not a contraction in any metric between dis-

tributions, and is in general much more temperamental than

the policy evaluation operators. We believe the conver-

gence issues we outline here are a symptom of the inherent

instability of greedy updates, as highlighted by e.g. Tsitsik-

lis (2002) and most recently Harutyunyan et al. (2016).

Let Π

∗

be the set of optimal policies. We begin by charac-

terizing what we mean by an optimal value distribution.

Deﬁnition 1 (Optimal value distribution). An optimal

value distribution is the v.d. of an optimal policy. The set

of optimal value distributions is Z

∗

:= {Z

∗

: π

∗

∈ Π

∗

We emphasize that not all value distributions with expecta-

tion Q

∗

are optimal: they must match the full distribution

of the return under some optimal policy.

Deﬁnition 2. A greedy policy π for Z ∈ Z maximizes the

expectation of Z. The set of greedy policies for Z is

:= {π :

π(a |x) E Z(x, a) = max

∈A

E Z(x, a

)}.

Recall that the expected Bellman optimality operator T is

T Q(x, a) = E R(x, a) + γ E

max

∈A

Q(x

, a

). (6)

The maximization at x

corresponds to some greedy policy.

Although this policy is implicit in (6), we cannot ignore it

in the distributional setting. We will call a distributional

Bellman optimality operator any operator T which imple-

ments a greedy selection rule, i.e.

T Z = T

Z for some π ∈ G

As in the policy evaluation setting, we are interested in the

behaviour of the iterates Z

k+1

:= T Z

, Z

∈ Z. Our ﬁrst

result is to assert that E Z

behaves as expected.

Lemma 4. Let Z

, Z

∈ Z. Then

kE T Z

− E T Z

∞

≤ γ kE Z

− E Z

∞

and in particular E Z

→ Q

∗

exponentially quickly.

By inspecting Lemma 4, we might expect that Z

con-

verges quickly in

to some ﬁxed point in Z

∗

. Unfor-

tunately, convergence is neither quick nor assured to reach

a ﬁxed point. In fact, the best we can hope for is pointwise

convergence, not even to the set Z

∗

but to the larger set of

nonstationary optimal value distributions.

剩余18页未读，继续阅读

GanD.GanD

粉丝: 3
资源: 90

深度强化学习的分布视角：理解Distributional RL

Distributional Reinforcement Learning with Quantile Regression.pdf

8.quantile regression dqn.ipynb

6.categorical dqn.ipynb

Reinforcement Learning An Introduction2019.pdf.zip

Word Segmentation:The Role of Distributional Cues.pdf

05_Word_Similarity-_Distributional_Similarity_II_8-15.pdf

04_Word_Similarity-_Distributional_Similarity_I_13-14.pdf

ReinforcementLearningMoscow.pdf

An Introduction to DRL.pdf

【哥德堡大学计算机博士论文】A study in metaphor detection and aptness assessment.pdf

最新资源