分布式强化学习中的循环经验回放缓冲区

需积分: 9 188 浏览量更新于2024-07-18 收藏 6.97MB PDF 举报

"RECURRENT EXPERIENCE REPLAY IN DISTRIBUTED REINFORCEMENT LEARNING" 这篇论文主要探讨了在分布式强化学习中使用循环经验回放（Recurrent Experience Replay, RER）来训练基于循环神经网络（RNN）的强化学习（RL）代理。作者针对参数滞后导致的表示漂移和循环状态的陈旧性问题进行了研究，并提出了改进的训练策略。通过使用单一的网络架构和固定的超参数设置，他们开发出了名为Recurrent Replay Distributed DQN（R2D2）的代理，该代理在Atari-57游戏上实现了三倍于之前最佳成绩的表现，并在DMLab-30上超越了当前的最佳水平。R2D2是首个在52款Atari游戏中超过人类水平的智能体。 1. 强化学习与循环神经网络强化学习是一种机器学习方法，通过与环境的交互来学习最优策略。近期，RL在解决复杂问题上取得了一系列成就，如在Atari 2600游戏上达到人类级别，战胜围棋世界冠军，以及在多人在线对战游戏DOTA中表现出竞争力。 2. 参数滞后与表示漂移在分布式强化学习中，由于网络更新的异步性，可能导致参数滞后问题，即某些部分的网络可能没有及时更新到最新状态。这会导致表示漂移，即模型对环境的理解逐渐偏离实际情况。 3. 循环状态的陈旧性循环神经网络在处理序列数据时，其内部状态会随时间演变。然而，在分布式设置中，这些状态可能无法实时更新，导致循环状态的陈旧性，影响学习效率和性能。 4. Recurrent Replay Distributed DQN (R2D2) R2D2是解决上述问题的一种方法，它结合了循环神经网络和经验回放的技术。经验回放用于打破数据序列的相关性，而循环神经网络则能捕获长期依赖关系。R2D2通过优化策略，减少了参数滞后和循环状态陈旧性的影响，从而提高了学习效率和性能。 5. 实验结果在Atari-57和DMLab-30的广泛实验中，R2D2展现出了卓越的性能，不仅超过了先前的算法，还在52款Atari游戏中达到了超过人类玩家的水平。 6. 结论与未来工作 R2D2的成功表明，循环经验回放在分布式强化学习中具有巨大的潜力。未来的研究可能涉及更深入地理解如何优化回放机制，以及如何将这种方法扩展到其他复杂的环境和任务中。这篇论文对于强化学习社区来说是一项重要贡献，因为它提供了一种有效的方法来应对分布式学习中的挑战，特别是在使用RNN时。R2D2的成功可能启发更多的研究，以改进RL算法并推动其在现实世界应用中的发展。

Under review as a conference paper at ICLR 2019

Here, θ

−

denotes the target network parameters which are copied from the online network parame-

ters θ every 2500 learner steps.

Our replay prioritization differs from that of Ape-X in that we use a mixture of max and mean

absolute n-step TD-errors δ

over the sequence: p = η max

+ (1 − η)

. We set η and the

priority exponent to 0.9. This more aggressive scheme is motivated by our observation that averaging

over long sequences tends to wash out large errors, thereby compressing the range of priorities and

limiting the ability of prioritization to pick out useful experience. We also found no beneﬁt from

using the importance weighting that has been typically applied with prioritized replay (Schaul et al.,

2016), and therefore omitted this step in R2D2.

Finally, compared to Ape-X, we used the slightly higher discount of γ = 0.997, and disabled the

loss-of-life-as-episode-end heuristic that has been used in Atari agents in some of the work since

(Mnih et al., 2015). A full list of hyper-parameters is provided in the appendix.

We train the R2D2 agent with a single GPU-based learner, performing approximately 5 network up-

dates per second (each update on a mini-batch of 64 length-80 sequences), and each actor performing

∼ 260 environment steps per second on Atari (∼ 130 per second on DMLab).

3 TRAINING RECURRENT RL AGENTS WITH EXPERIENCE REPLAY

In order to achieve good performance in a partially observed environment, an RL agent requires

a state representation that encodes information about its state-action trajectory in addition to its

current observation. The most common way to achieve this is by using an RNN, typically an LSTM

(Hochreiter & Schmidhuber, 1997), as part of the agent’s state encoding. To train an RNN from

replay and enable it to learn meaningful long-term dependencies, whole state-action trajectories

need to be stored in replay and used for training the network. Hausknecht & Stone (2015) compared

two strategies of training an LSTM from replayed experience:

• Using a zero start state to initialize the network at the beginning of sampled sequences.

• Replaying whole episode trajectories.

The zero start state strategy’s appeal lies in its simplicity, and it allows independent decorrelated

sampling of relatively short sequences, which is important for robust optimization of a neural net-

work. On the other hand, it forces the RNN to learn to recover meaningful predictions from an

atypical initial recurrent state (‘initial recurrent state mismatch’), which may limit its ability to fully

rely on its recurrent state and learn to exploit long temporal correlations. The second strategy on

the other hand avoids the problem of ﬁnding a suitable initial state, but creates a number of prac-

tical, computational, and algorithmic issues due to varying and potentially environment-dependent

sequence length, and higher variance of network updates because of the highly correlated nature of

states in a trajectory when compared to training on randomly sampled batches of experience tuples.

The authors observed little difference between their two strategies for the empirical agent perfor-

mance on a set of Atari games, and therefore opted for the simpler zero state strategy. One possible

explanation for this is that in some cases (as we will see below), an LSTM tends to converge to a

more ‘typical’ state if allowed a certain number of ‘burn-in’ steps, and so recovers from a bad initial

recurrent state on a sufﬁciently long sequence. We also hypothesize that while the zero state strat-

egy may sufﬁce in the largely fully observable Atari domain, it prevents a recurrent network from

learning actual long-term dependencies in more memory-critical domains (e.g. on DMLab).

To ﬁx these issues, we propose and evaluate two strategies for training a recurrent neural network

from randomly sampled replay sequences, that can be used individually or in combination:

• Storing the recurrent state in replay and using it to initialize the network at training time.

This partially remedies the weakness of the zero start state strategy, however it may suffer

from the effect of ‘representational drift’ leading to ‘recurrent state staleness‘, as the stored

recurrent state generated by a sufﬁciently old network could differ signiﬁcantly from a

typical state produced by a more recent version.

• Allow the network a ‘burn-in period’ by using a portion of the replay sequence only for

unrolling the network and producing a start state, and update the network only on the

剩余14页未读，继续阅读

nfang163

粉丝: 4
资源: 22

分布式强化学习中的循环经验回放缓冲区

Distributed Systems 3 - 《分布式系统原理与范型》第三版 英文原版

distributed-application:分布式应用

Distributed Systems

Recurrent Reinforcement Learning Algorithm Matlab

Recurrent Reinforcement Learning Algorithm Matlab Implementation

deep reinforcement learning

Playing Atari with Deep Reinforcement Learning

Neural architecture search with reinforcement learning-Waymo 增强学习

Deep Learning Recurrent Neural Networks in Python

Deep Learning: Recurrent Neural Networks in Python

最新资源

Distributed Systems 3 - 《分布式系统原理与范型》第三版英文原版