Under review as a conference paper at ICLR 2019
Here, θ
−
denotes the target network parameters which are copied from the online network parame-
ters θ every 2500 learner steps.
Our replay prioritization differs from that of Ape-X in that we use a mixture of max and mean
absolute n-step TD-errors δ
i
over the sequence: p = η max
i
δ
i
+ (1 − η)
¯
δ
i
. We set η and the
priority exponent to 0.9. This more aggressive scheme is motivated by our observation that averaging
over long sequences tends to wash out large errors, thereby compressing the range of priorities and
limiting the ability of prioritization to pick out useful experience. We also found no benefit from
using the importance weighting that has been typically applied with prioritized replay (Schaul et al.,
2016), and therefore omitted this step in R2D2.
Finally, compared to Ape-X, we used the slightly higher discount of γ = 0.997, and disabled the
loss-of-life-as-episode-end heuristic that has been used in Atari agents in some of the work since
(Mnih et al., 2015). A full list of hyper-parameters is provided in the appendix.
We train the R2D2 agent with a single GPU-based learner, performing approximately 5 network up-
dates per second (each update on a mini-batch of 64 length-80 sequences), and each actor performing
∼ 260 environment steps per second on Atari (∼ 130 per second on DMLab).
3 TRAINING RECURRENT RL AGENTS WITH EXPERIENCE REPLAY
In order to achieve good performance in a partially observed environment, an RL agent requires
a state representation that encodes information about its state-action trajectory in addition to its
current observation. The most common way to achieve this is by using an RNN, typically an LSTM
(Hochreiter & Schmidhuber, 1997), as part of the agent’s state encoding. To train an RNN from
replay and enable it to learn meaningful long-term dependencies, whole state-action trajectories
need to be stored in replay and used for training the network. Hausknecht & Stone (2015) compared
two strategies of training an LSTM from replayed experience:
• Using a zero start state to initialize the network at the beginning of sampled sequences.
• Replaying whole episode trajectories.
The zero start state strategy’s appeal lies in its simplicity, and it allows independent decorrelated
sampling of relatively short sequences, which is important for robust optimization of a neural net-
work. On the other hand, it forces the RNN to learn to recover meaningful predictions from an
atypical initial recurrent state (‘initial recurrent state mismatch’), which may limit its ability to fully
rely on its recurrent state and learn to exploit long temporal correlations. The second strategy on
the other hand avoids the problem of finding a suitable initial state, but creates a number of prac-
tical, computational, and algorithmic issues due to varying and potentially environment-dependent
sequence length, and higher variance of network updates because of the highly correlated nature of
states in a trajectory when compared to training on randomly sampled batches of experience tuples.
The authors observed little difference between their two strategies for the empirical agent perfor-
mance on a set of Atari games, and therefore opted for the simpler zero state strategy. One possible
explanation for this is that in some cases (as we will see below), an LSTM tends to converge to a
more ‘typical’ state if allowed a certain number of ‘burn-in’ steps, and so recovers from a bad initial
recurrent state on a sufficiently long sequence. We also hypothesize that while the zero state strat-
egy may suffice in the largely fully observable Atari domain, it prevents a recurrent network from
learning actual long-term dependencies in more memory-critical domains (e.g. on DMLab).
To fix these issues, we propose and evaluate two strategies for training a recurrent neural network
from randomly sampled replay sequences, that can be used individually or in combination:
• Storing the recurrent state in replay and using it to initialize the network at training time.
This partially remedies the weakness of the zero start state strategy, however it may suffer
from the effect of ‘representational drift’ leading to ‘recurrent state staleness‘, as the stored
recurrent state generated by a sufficiently old network could differ significantly from a
typical state produced by a more recent version.
• Allow the network a ‘burn-in period’ by using a portion of the replay sequence only for
unrolling the network and producing a start state, and update the network only on the
3