Asynchronous Methods for Deep Reinforcement Learning
ing minibatches. This reduces the chances of multiple ac-
tor learners overwriting each other’s updates. Accumulat-
ing updates over several steps also provides some ability to
trade off computational efficiency for data efficiency.
Finally, we found that giving each thread a different explo-
ration policy helps improve robustness. Adding diversity
to exploration in this manner also generally improves per-
formance through better exploration. While there are many
possible ways of making the exploration policies differ we
experiment with using -greedy exploration with periodi-
cally sampled from some distribution by each thread.
Asynchronous one-step Sarsa: The asynchronous one-
step Sarsa algorithm is the same as asynchronous one-step
Q-learning as given in Algorithm 1 except that it uses a dif-
ferent target value for Q(s, a). The target value used by
one-step Sarsa is r + γQ(s
0
, a
0
; θ
−
) where a
0
is the action
taken in state s
0
(Rummery & Niranjan, 1994; Sutton &
Barto, 1998). We again use a target network and updates
accumulated over multiple timesteps to stabilize learning.
Asynchronous n-step Q-learning: Pseudocode for our
variant of multi-step Q-learning is shown in Supplementary
Algorithm S2. The algorithm is somewhat unusual because
it operates in the forward view by explicitly computing n-
step returns, as opposed to the more common backward
view used by techniques like eligibility traces (Sutton &
Barto, 1998). We found that using the forward view is eas-
ier when training neural networks with momentum-based
methods and backpropagation through time. In order to
compute a single update, the algorithm first selects actions
using its exploration policy for up to t
max
steps or until a
terminal state is reached. This process results in the agent
receiving up to t
max
rewards from the environment since
its last update. The algorithm then computes gradients for
n-step Q-learning updates for each of the state-action pairs
encountered since the last update. Each n-step update uses
the longest possible n-step return resulting in a one-step
update for the last state, a two-step update for the second
last state, and so on for a total of up to t
max
updates. The
accumulated updates are applied in a single gradient step.
Asynchronous advantage actor-critic: The algorithm,
which we call asynchronous advantage actor-critic (A3C),
maintains a policy π(a
t
|s
t
; θ) and an estimate of the value
function V (s
t
; θ
v
). Like our variant of n-step Q-learning,
our variant of actor-critic also operates in the forward view
and uses the same mix of n-step returns to update both the
policy and the value-function. The policy and the value
function are updated after every t
max
actions or when a
terminal state is reached. The update performed by the al-
gorithm can be seen as ∇
θ
0
log π(a
t
|s
t
; θ
0
)A(s
t
, a
t
; θ, θ
v
)
where A(s
t
, a
t
; θ, θ
v
) is an estimate of the advantage func-
tion given by
P
k−1
i=0
γ
i
r
t+i
+ γ
k
V (s
t+k
; θ
v
) − V (s
t
; θ
v
),
where k can vary from state to state and is upper-bounded
by t
max
. The pseudocode for the algorithm is presented in
Supplementary Algorithm S3.
As with the value-based methods we rely on parallel actor-
learners and accumulated updates for improving training
stability. Note that while the parameters θ of the policy
and θ
v
of the value function are shown as being separate
for generality, we always share some of the parameters in
practice. We typically use a convolutional neural network
that has one softmax output for the policy π(a
t
|s
t
; θ) and
one linear output for the value function V (s
t
; θ
v
), with all
non-output layers shared.
We also found that adding the entropy of the policy π to the
objective function improved exploration by discouraging
premature convergence to suboptimal deterministic poli-
cies. This technique was originally proposed by (Williams
& Peng, 1991), who found that it was particularly help-
ful on tasks requiring hierarchical behavior. The gradi-
ent of the full objective function including the entropy
regularization term with respect to the policy parame-
ters takes the form ∇
θ
0
log π(a
t
|s
t
; θ
0
)(R
t
− V (s
t
; θ
v
)) +
β∇
θ
0
H(π(s
t
; θ
0
)), where H is the entropy. The hyperpa-
rameter β controls the strength of the entropy regulariza-
tion term.
Optimization: We investigated three different optimiza-
tion algorithms in our asynchronous framework – SGD
with momentum, RMSProp (Tieleman & Hinton, 2012)
without shared statistics, and RMSProp with shared statis-
tics. We used the standard non-centered RMSProp update
given by
g = αg + (1 − α)∆θ
2
and θ ← θ − η
∆θ
√
g +
, (1)
where all operations are performed elementwise. A com-
parison on a subset of Atari 2600 games showed that a vari-
ant of RMSProp where statistics g are shared across threads
is considerably more robust than the other two methods.
Full details of the methods and comparisons are included
in Supplementary Section 7.
5. Experiments
We use four different platforms for assessing the properties
of the proposed framework. We perform most of our exper-
iments using the Arcade Learning Environment (Bellemare
et al., 2012), which provides a simulator for Atari 2600
games. This is one of the most commonly used benchmark
environments for RL algorithms. We use the Atari domain
to compare against state of the art results (Van Hasselt et al.,
2015; Wang et al., 2015; Schaul et al., 2015; Nair et al.,
2015; Mnih et al., 2015), as well as to carry out a detailed
stability and scalability analysis of the proposed methods.
We performed further comparisons using the TORCS 3D
car racing simulator (Wymann et al., 2013). We also use