异步深度强化学习：A3C算法的革新与性能提升

需积分: 50 115 浏览量更新于2024-07-19 收藏 2.2MB PDF 举报

本文档主要探讨了"异步方法在深度强化学习中的应用"（Asynchronous Methods for Deep Reinforcement Learning），由Volodymyr Mnih等人撰写，发表于Google DeepMind的研究团队。该研究提出了一个简单且轻量级的框架，特别针对深度神经网络控制器的优化，利用异步梯度下降技术。异步方法对于标准强化学习算法（包括A3C算法）的改进是研究的核心。 A3C（Asynchronous Advantage Actor-Critic）算法是四种被研究的强化学习算法之一。通过并行的演员-学习者结构，研究发现这种异步方式对训练具有稳定作用，使得这四种方法都能成功地训练出高性能的神经网络控制器。其中，异步的Actor-Critic算法表现出色，它不仅在Atari游戏领域超越了当时的最佳成绩，而且在单个多核CPU上进行训练的时间只有GPU的一半，这表明了其在资源效率上的优势。异步Actor-Critic的性能不仅仅局限于Atari游戏，还扩展到了连续运动控制任务以及一项新的视觉输入下的三维迷宫导航任务。这意味着这种方法不仅适用于离散的策略选择问题，也能适应需要连续决策和感知输入的复杂环境。这项工作的核心贡献在于将异步学习与深度强化学习结合，不仅提高了训练效率，还提升了模型在实际场景中的表现能力。这对于推动深度强化学习在各种领域的应用具有重要意义，尤其是在资源有限或需要高效学习的场景中。同时，它也为后续研究提供了新的视角和实践策略，促进了强化学习领域的技术进步。

Asynchronous Methods for Deep Reinforcement Learning

ing minibatches. This reduces the chances of multiple ac-

tor learners overwriting each other’s updates. Accumulat-

ing updates over several steps also provides some ability to

trade off computational efﬁciency for data efﬁciency.

Finally, we found that giving each thread a different explo-

ration policy helps improve robustness. Adding diversity

to exploration in this manner also generally improves per-

formance through better exploration. While there are many

possible ways of making the exploration policies differ we

experiment with using -greedy exploration with  periodi-

cally sampled from some distribution by each thread.

Asynchronous one-step Sarsa: The asynchronous one-

step Sarsa algorithm is the same as asynchronous one-step

Q-learning as given in Algorithm 1 except that it uses a dif-

ferent target value for Q(s, a). The target value used by

one-step Sarsa is r + γQ(s

, a

; θ

−

) where a

is the action

taken in state s

(Rummery & Niranjan, 1994; Sutton &

Barto, 1998). We again use a target network and updates

accumulated over multiple timesteps to stabilize learning.

Asynchronous n-step Q-learning: Pseudocode for our

variant of multi-step Q-learning is shown in Supplementary

Algorithm S2. The algorithm is somewhat unusual because

it operates in the forward view by explicitly computing n-

step returns, as opposed to the more common backward

view used by techniques like eligibility traces (Sutton &

Barto, 1998). We found that using the forward view is eas-

ier when training neural networks with momentum-based

methods and backpropagation through time. In order to

compute a single update, the algorithm ﬁrst selects actions

using its exploration policy for up to t

max

steps or until a

terminal state is reached. This process results in the agent

receiving up to t

max

rewards from the environment since

its last update. The algorithm then computes gradients for

n-step Q-learning updates for each of the state-action pairs

encountered since the last update. Each n-step update uses

the longest possible n-step return resulting in a one-step

update for the last state, a two-step update for the second

last state, and so on for a total of up to t

max

updates. The

accumulated updates are applied in a single gradient step.

Asynchronous advantage actor-critic: The algorithm,

which we call asynchronous advantage actor-critic (A3C),

maintains a policy π(a

; θ) and an estimate of the value

function V (s

; θ

). Like our variant of n-step Q-learning,

our variant of actor-critic also operates in the forward view

and uses the same mix of n-step returns to update both the

policy and the value-function. The policy and the value

function are updated after every t

max

actions or when a

terminal state is reached. The update performed by the al-

gorithm can be seen as ∇

log π(a

; θ

)A(s

, a

; θ, θ

)

where A(s

, a

; θ, θ

) is an estimate of the advantage func-

tion given by

k−1

i=0

t+i

+ γ

V (s

t+k

; θ

) − V (s

; θ

where k can vary from state to state and is upper-bounded

by t

max

. The pseudocode for the algorithm is presented in

Supplementary Algorithm S3.

As with the value-based methods we rely on parallel actor-

learners and accumulated updates for improving training

stability. Note that while the parameters θ of the policy

and θ

of the value function are shown as being separate

for generality, we always share some of the parameters in

practice. We typically use a convolutional neural network

that has one softmax output for the policy π(a

; θ) and

one linear output for the value function V (s

; θ

), with all

non-output layers shared.

We also found that adding the entropy of the policy π to the

objective function improved exploration by discouraging

premature convergence to suboptimal deterministic poli-

cies. This technique was originally proposed by (Williams

& Peng, 1991), who found that it was particularly help-

ful on tasks requiring hierarchical behavior. The gradi-

ent of the full objective function including the entropy

regularization term with respect to the policy parame-

ters takes the form ∇

log π(a

; θ

)(R

− V (s

; θ

)) +

β∇

H(π(s

; θ

)), where H is the entropy. The hyperpa-

rameter β controls the strength of the entropy regulariza-

tion term.

Optimization: We investigated three different optimiza-

tion algorithms in our asynchronous framework – SGD

with momentum, RMSProp (Tieleman & Hinton, 2012)

without shared statistics, and RMSProp with shared statis-

tics. We used the standard non-centered RMSProp update

given by

g = αg + (1 − α)∆θ

and θ ← θ − η

∆θ

√

g + 

, (1)

where all operations are performed elementwise. A com-

parison on a subset of Atari 2600 games showed that a vari-

ant of RMSProp where statistics g are shared across threads

is considerably more robust than the other two methods.

Full details of the methods and comparisons are included

in Supplementary Section 7.

5. Experiments

We use four different platforms for assessing the properties

of the proposed framework. We perform most of our exper-

iments using the Arcade Learning Environment (Bellemare

et al., 2012), which provides a simulator for Atari 2600

games. This is one of the most commonly used benchmark

environments for RL algorithms. We use the Atari domain

to compare against state of the art results (Van Hasselt et al.,

2015; Wang et al., 2015; Schaul et al., 2015; Nair et al.,

2015; Mnih et al., 2015), as well as to carry out a detailed

stability and scalability analysis of the proposed methods.

We performed further comparisons using the TORCS 3D

car racing simulator (Wymann et al., 2013). We also use

剩余18页未读，继续阅读

ningweikang

粉丝: 0

异步深度强化学习：A3C算法的革新与性能提升

6.1 Actor Critic 演员评论家 (强化学习 Reinforcement Learning 教学)

强化学习综述

Deep Reinforcement Learning for Wireless networks

Algorithm for reinforcement learning.zip（解压即可，无密码）

Algorithms of Reinforcement Learning

增强学习Reinforcement-Learning经典算法梳理.docx

deep q_learning

vue.js v2.5.17

DM8-SQL语言详解及其数据管理和查询操作指南

最新资源