深度强化学习：优先经验回放缓存(PRIORITIZED EXPERIENCE REPLAY)

需积分: 31 195 浏览量更新于2024-07-17 收藏 1.61MB PDF 举报

"这篇PDF是关于优先经验回放（Prioritized Experience Replay）的原始研究论文，作者来自Google DeepMind，发表于ICLR 2016会议。文章旨在介绍和阐述如何通过优先经验回放来提升深度强化学习（Deep Reinforcement Learning, DRL）中的学习效率，特别是针对DQN（Deep Q-Networks）算法的应用。" 在强化学习中，经验回放（Experience Replay）是一种关键的技术，它允许在线学习的智能体回顾并利用过去的经历。在以往的工作中，经验过渡样本通常是均匀随机地从回放缓冲区中抽取的。然而，这种方法忽视了不同经历的重要程度，使得重要的经历和无足轻重的经历被同样频率地重播。本文提出了一种框架，即优先经验回放，用于更频繁地重播重要的过渡状态，从而提高学习效率。作者们将优先经验回放在DQN中实施，DQN是一个成功实现多款Atari游戏人类水平表现的强化学习算法。通过优先经验回放的DQN在49款Atari游戏中有41款的表现优于采用均匀回放的DQN，创造了新的最优记录。 1. 引言深度强化学习的在线学习过程中，智能体逐步更新其策略、价值函数或模型的参数。然而，随机采样的回放可能会导致学习过程中的样本利用率不均，某些关键的学习时刻可能被忽视。优先经验回放的目标就是解决这个问题，通过为每个经历分配一个优先级，使得更重要的样本更有可能被选择进行回放。 2. 方法优先经验回放的核心是根据经历的某种度量标准（如过渡的TD误差）赋予它们不同的优先级。高优先级的样本将更频繁地被重播，而低优先级的样本则较少被选中。这有助于快速学习关键性行为，同时减少了训练过程中的样本波动。 3. 实验实验部分展示了在Atari游戏环境中，优先经验回放相比于传统经验回放的显著优势。通过调整优先级采样策略（如基于概率的比例采样），可以在保持稳定性和学习速度之间找到平衡。 4. 结论优先经验回放不仅是强化学习的一种有效增强，而且对于提高DQN等深度强化学习算法的性能具有重要意义。它为未来的研究提供了新的视角，即如何更好地管理和利用经验数据，以优化学习过程。 5. 扩展应用优先经验回放不仅限于DQN，还可以应用于其他形式的强化学习算法，如双Q学习、连续动作空间的算法等，以提高它们的泛化能力和学习速度。优先经验回放是一种创新的技术，它增强了强化学习的效率，特别是在处理大量复杂数据的深度强化学习任务中。通过对经验的优先级排序，学习过程可以更加聚焦于关键信息，从而更快地收敛到最优策略。

Published as a conference paper at ICLR 2016

Algorithm 1 Double DQN with proportional prioritization

1: Input: minibatch k, step-size η, replay period K and size N, exponents α and β, budget T .

2: Initialize replay memory H = ∅, ∆ = 0, p

= 1

3: Observe S

and choose A

∼ π

)

4: for t = 1 to T do

5: Observe S

, R

, γ

6: Store transition (S

t−1

, A

t−1

, R

, γ

, S

) in H with maximal priority p

= max

i<t

7: if t ≡ 0 mod K then

8: for j = 1 to k do

9: Sample transition j ∼ P (j) = p

10: Compute importance-sampling weight w

= (N · P(j))

−β

/ max

11: Compute TD-error δ

= R

+ γ

target

, arg max

Q(S

, a)) − Q(S

j−1

, A

j−1

)

12: Update transition priority p

← |δ

13: Accumulate weight-change ∆ ← ∆ + w

· δ

· ∇

Q(S

j−1

, A

j−1

)

14: end for

15: Update weights θ ← θ + η · ∆, reset ∆ = 0

16: From time to time copy weights into target network θ

target

← θ

17: end if

18: Choose action A

∼ π

)

19: end for

3.4 ANNEALING THE BIAS

The estimation of the expected value with stochastic updates relies on those updates corresponding

to the same distribution as its expectation. Prioritized replay introduces bias because it changes this

distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will

converge to (even if the policy and state distribution are ﬁxed). We can correct this bias by using

importance-sampling (IS) weights



P (i)



that fully compensates for the non-uniform probabilities P (i) if β = 1. These weights can be folded

into the Q-learning update by using w

instead of δ

(this is thus weighted IS, not ordinary IS, see

e.g. Mahmood et al., 2014). For stability reasons, we always normalize weights by 1/ max

that they only scale the update downwards.

In typical reinforcement learning scenarios, the unbiased nature of the updates is most important

near convergence at the end of training, as the process is highly non-stationary anyway, due to

changing policies, state distributions and bootstrap targets; we hypothesize that a small bias can be

ignored in this context (see also Figure 12 in the appendix for a case study of full IS correction

on Atari). We therefore exploit the ﬂexibility of annealing the amount of importance-sampling

correction over time, by deﬁning a schedule on the exponent β that reaches 1 only at the end of

learning. In practice, we linearly anneal β from its initial value β

to 1. Note that the choice of this

hyperparameter interacts with choice of prioritization exponent α; increasing both simultaneously

prioritizes sampling more aggressively at the same time as correcting for it more strongly.

Importance sampling has another beneﬁt when combined with prioritized replay in the context of

non-linear function approximation (e.g. deep neural networks): here large steps can be very disrup-

tive, because the ﬁrst-order approximation of the gradient is only reliable locally, and have to be pre-

vented with a smaller global step-size. In our approach instead, prioritization makes sure high-error

transitions are seen many times, while the IS correction reduces the gradient magnitudes (and thus

the effective step size in parameter space), and allowing the algorithm follow the curvature of highly

non-linear optimization landscapes because the Taylor expansion is constantly re-approximated.

We combine our prioritized replay algorithm into a full-scale reinforcement learning agent, based

on the state-of-the-art Double DQN algorithm. Our principal modiﬁcation is to replace the uniform

random sampling used by Double DQN with our stochastic prioritization and importance sampling

methods (see Algorithm 1).

剩余20页未读，继续阅读

GanD.GanD

粉丝: 3
资源: 90

深度强化学习：优先经验回放缓存(PRIORITIZED EXPERIENCE REPLAY)

PrioritizedReplay:本文随附的代码

sumTree的实现，详细注解

ReinforcementLearningMoscow.pdf

Reinforcement Learning With Open A TensorFlow and Keras Using Python.pdf

基于排序优先经验回放的竞争深度Q网络学习.pdf

基于深度强化学习的行星车路径规划方法研究.pdf

基于深度强化学习的有轨电车信号优先控制.pdf

基于图像卷积神经网络的匝道控制深度强化学习算法研究.pdf

TD learning,PER和Epsilon：深度学习对高等教育教学的启示.pdf

rip宣告网段选择版本

最新资源