模型加速的连续深度Q学习

需积分: 11 9 浏览量更新于2024-09-03 收藏 1.63MB PDF 举报

"这篇论文提出了一种结合模型基学习（model-based）的连续深度Q-学习（Continuous Deep Q-Learning）方法，旨在提高在解决连续动作空间问题时的效率，特别是针对物理系统的应用。作者包括来自剑桥大学、马克斯普朗克智能系统研究所、谷歌大脑和DeepMind的研究人员。他们提出的技术包括正常化优势函数（Normalized Advantage Functions, NAF）和模型辅助加速，以降低深度强化学习的样本复杂性。" 在深度强化学习中，无模型的方法已经在各种复杂问题上取得了显著成果，并且能够处理大规模神经网络策略和价值函数。然而，这种方法的样本复杂性，尤其是在高维函数近似器的使用下，限制了它在物理系统等领域的应用。论文的焦点在于如何减少在连续控制任务中深度强化学习的样本复杂性。首先，论文引入了一种连续Q学习的变体——正常化优势函数（NAF）。NAF作为更常用的策略梯度和演员-评论家算法的替代方案，提供了一种优化策略，可以更有效地在连续动作空间中进行学习。NAF通过规范化优势值，减少了训练过程中的方差，从而提高了学习的稳定性和效率。其次，论文探讨了模型基学习在加速深度强化学习中的作用。通过利用环境的动态模型，模型基学习可以在仿真中进行更多的“预训练”，减少实际环境交互的次数，从而降低了样本需求。这种模型辅助的加速策略可以帮助算法更快地收敛，并在实际应用中展现出更好的性能。此外，论文可能还涉及如何结合模型基和模型自由方法的优缺点，创建一个混合方法，这既能利用模型的预测能力，又能利用模型自由方法对未知环境的适应性。这样的结合可能会进一步提高学习的效率和泛化能力。这篇论文的工作对于那些需要高效学习连续控制策略的领域，如机器人控制、自动驾驶或环境模拟，具有重要的理论和实践意义。通过提出NAF和模型辅助加速，研究人员希望为解决高维度连续动作空间的问题提供一个更有效和实用的解决方案。

Continuous Deep Q-Learning with Model-based Acceleration

expressive models themselves require substantially more

data, and that otherwise efﬁcient algorithms like Dyna-Q

are vulnerable to poor model approximations.

3. Background

In reinforcement learning, the goal is to learn a policy to

control a system with states x ∈ X and actions u ∈ U

in environment E, so as to maximize the expected sum of

returns according to a reward function r(x, u). The dy-

namical system is deﬁned by an initial state distribution

p(x

) and a dynamics distribution p(x

t+1

, u

). At each

time step t ∈ [1, T ], the agent chooses an action u

ac-

cording to its current policy π(u

), and observes a re-

ward r(x

, u

). The agent then experiences a transition to a

new state sampled from the dynamics distribution, and we

can express the resulting state visitation frequency of the

policy π as ρ

). Deﬁne R

i=t

(i−t)

r(x

, u

the goal is to maximize the expected sum of returns, given

by R = E

i≥1

∼E,u

i≥1

∼π

], where γ is a discount

factor that prioritizes earlier rewards over later ones. With

γ < 1, we can also set T = ∞, though we use a ﬁnite hori-

zon for all of the tasks in our experiments. The expected re-

turn R can be optimized using a variety of model-free and

model-based algorithms. In this section, we review several

of these methods that we build on in our work.

Model-Free Reinforcement Learning. When the sys-

tem dynamics p(x

t+1

, u

) are not known, as is often

the case with physical systems such as robots, policy gra-

dient methods (Peters & Schaal, 2006) and value function

or Q-function learning with function approximation (Sut-

ton et al., 1999) are often preferred. Policy gradient meth-

ods provide a simple, direct approach to RL, which can

succeed on high-dimensional problems, but potentially re-

quires a large number of samples (Schulman et al., 2015;

2016). Off-policy algorithms that use value or Q-function

approximation can in principle achieve better data efﬁ-

ciency (Lillicrap et al., 2016). However, adapting such

methods to continuous tasks typically requires optimizing

two function approximators on different objectives. We in-

stead build on standard Q-learning, which has a single ob-

jective. We summarize Q-learning in this section. The Q

function Q

, u

) corresponding to a policy π is deﬁned

as the expected return from x

after taking action u

and

following the policy π thereafter:

, u

) = E

i≥t

i>t

∼E,u

i>t

∼π

, u

]

(1)

Q-learning learns a greedy deterministic policy

µ(x

) = arg max

Q(x

, u

), which corresponds to

π(u

) = δ(u

= µ(x

)). Let θ

parametrize the

action-value function, and β be an arbitrary exploration

policy, the learning objective is to minimize the Bellman

error, where we ﬁx the target y

L(θ

) = E

∼ρ

∼β,r

∼E

[(Q(x

, u

|θ

) − y

)

]

= r(x

, u

) + γQ(x

t+1

, µ(x

t+1

))

(2)

For continuous action problems, Q-learning becomes difﬁ-

cult, because it requires maximizing a complex, nonlinear

function at each update. For this reason, continuous do-

mains are often tackled using actor-critic methods (Konda

& Tsitsiklis, 1999; Hafner & Riedmiller, 2011; Silver et al.,

2014; Lillicrap et al., 2016), where a separate parame-

terized “actor” policy π is learned in addition to the Q-

function or value function “critic,” such as Deep Determin-

istic Policy Gradient (DDPG) algorithm (Lillicrap et al.,

2016).

In order to describe our method in the following sections, it

will be useful to also deﬁne the value function V

, u

)

and advantage function A

, u

) of a given policy π:

) = E

i≥t

i>t

∼E,u

i≥t

∼π

, u

]

, u

) = Q

, u

) − V

(3)

Model-Based Reinforcement Learning. If we know the

dynamics p(x

t+1

, u

), or if we can approximate them

with some learned model ˆp(x

t+1

, u

), we can use

model-based RL and optimal control. While a wide range

of model-based RL and control methods have been pro-

posed in the literature (Deisenroth et al., 2013; Kober &

Peters, 2012), two are particularly relevant for this work:

iterative LQG (iLQG) (Li & Todorov, 2004) and Dyna-

Q (Sutton, 1990). The iLQG algorithm optimizes tra-

jectories by iteratively constructing locally optimal lin-

ear feedback controllers under a local linearization of the

dynamics ˆp(x

t+1

, u

) = N (f

+ f

, F

) and a

quadratic expansion of the rewards r(x

, u

) (Tassa et al.,

2012). Under linear dynamics and quadratic rewards, the

action-value function Q(x

, u

) and value function V (x

)

are locally quadratic and can be computed by dynamics

programming. The optimal policy can be derived ana-

lytically from the quadratic Q(x

, u

) and V (x

) func-

tions, and corresponds to a linear feedback controller

g(x

) =

+ k

+ K

−

), where k

is an open-

loop term, K

is the closed-loop feedback matrix, and

and

are the states and actions of the nominal trajectory,

which is the average trajectory of the controller. Employing

the maximum entropy objective (Levine & Koltun, 2013),

we can also construct a linear-Gaussian controller, where

c is a scalar to adjust for arbitrary scaling of the reward

magnitudes:

iLQG

) = N (

+ k

+ K

−

), −cQ

−1

u,ut

)

(4)

When the dynamics are not known, a particularly effective

way to use iLQG is to combine it with learned time-varying

剩余12页未读，继续阅读

oldxacorn

粉丝: 1

模型加速的连续深度Q学习

20190806-10篇经典深度强化学习资料.rar

deep q_learning

Cloud-based Machine Learning Model Management: How to Efficiently Supervise Your AI Assets

Real-Time Machine Learning Model Update Strategies: 3 Tips to Keep Your Model Ahead

【Foundation】Detailed Explanation of MATLAB Toolbox: Deep Learning Toolbox

from Feature Extraction to Deep Learning

[Practical Guide]: Building a GAN Model from Scratch: Step-by-Step Optimization for Your First AI ...

【LSTM Model Time Series Forecasting】: In-depth Understanding and Practical Guide

Best Practices for Model Deployment: 5 Steps to Ensure Your Model Runs Steadily

Advanced Topics in MATLAB Control System Design: Adaptive and Learning Control

最新资源