高维连续控制：利用广义优势估计强化学习

需积分: 9 22 浏览量更新于2024-09-06 收藏 1.71MB PDF 举报

"这篇论文是关于在高维连续控制中使用广义优势估计（Generalized Advantage Estimation，GAE）的，它在强化学习领域具有重要应用。该论文由John Schulman等人在ICLR 2016会议上发表，来自加州大学伯克利分校的电气工程和计算机科学系。" 在强化学习中，策略梯度方法因其能直接优化累积奖励且易于与神经网络等非线性函数近似器结合而受到欢迎。然而，这些方法通常需要大量的样本，并且在面对不断变化的数据时，稳定性和持续改进的难度较大。针对这两个挑战，论文提出了以下解决方案：首先，为了解决样本需求量大的问题，论文引入了价值函数来显著降低策略梯度估计的方差，尽管这会导致一定的偏差。这里采用了一种指数加权的Advantage函数估计器，类似于Temporal Difference (TD) 学习中的TD(λ)算法。Advantage函数是状态动作值函数Q(s,a)与状态值函数V(s)的差，用于衡量在特定状态下采取某一行动相对于采取最优行动的额外收益。其次，为了解决数据非平稳导致的优化稳定性问题，论文采用了信任区域优化策略。这种策略同时应用于策略和价值函数的优化，通过限制策略更新的幅度，确保每次更新都在一个确定的信任区域内，从而提高训练的稳定性。信任区域优化能够防止策略过快地变化，避免了因为策略的剧烈波动而导致的学习不稳定。此外，广义优势估计（GAE）通过平滑优势函数，减少了在连续空间中选择行动的敏感性，从而增强了策略学习的效率。它通过一种线性组合的优势估计，结合了即时和延迟的奖励信息，既降低了方差，又保持了长期奖励的考虑。这篇论文通过引入GAE和信任区域优化，提供了一种在高维度连续控制任务中更有效、更稳定的强化学习策略。这种方法不仅减少了所需样本数量，还提高了学习过程的稳定性，对于处理复杂环境和高维输入的智能系统设计具有重要的理论和实践意义。

Published as a conference paper at ICLR 2016

Here, the subscript of E enumerates the variables being integrated over, where states and actions are

sampled sequentially from the dynamics model P (s

t+1

| s

, a

) and policy π(a

| s

), respectively.

The colon notation a : b refers to the inclusive range (a, a + 1, . . . , b). These formulas are well

known and straightforward to obtain; they follow directly from Proposition 1, which will be stated

shortly.

The choice Ψ

= A

, a

) yields almost the lowest possible variance, though in practice, the

advantage function is not known and must be estimated. This statement can be intuitively justiﬁed by

the following interpretation of the policy gradient: that a step in the policy gradient direction should

increase the probability of better-than-average actions and decrease the probability of worse-than-

average actions. The advantage function, by it’s deﬁnition A

(s, a) = Q

(s, a) − V

(s), measures

whether or not the action is better or worse than the policy’s default behavior. Hence, we should

choose Ψ

to be the advantage function A

, a

), so that the gradient term Ψ

∇

log π

| s

)

points in the direction of increased π

| s

) if and only if A

, a

) > 0. See Greensmith et al.

(2004) for a more rigorous analysis of the variance of policy gradient estimators and the effect of

using a baseline.

We will introduce a parameter γ that allows us to reduce variance by downweighting rewards cor-

responding to delayed effects, at the cost of introducing bias. This parameter corresponds to the

discount factor used in discounted formulations of MDPs, but we treat it as a variance reduction

parameter in an undiscounted problem; this technique was analyzed theoretically by Marbach &

Tsitsiklis (2003); Kakade (2001b); Thomas (2014). The discounted value functions are given by:

π,γ

)

= E

t+1:∞

t:∞

∞

l=0

t+l

π,γ

, a

)

= E

t+1:∞

∞

l=0

t+l

(4)

π,γ

, a

)

= Q

π,γ

, a

) − V

π,γ

). (5)

The discounted approximation to the policy gradient is deﬁned as follows:

= E

0:∞

∞

t=0

π,γ

, a

)∇

log π

| s

)

. (6)

The following section discusses how to obtain biased (but not too biased) estimators for A

π,γ

, giving

us noisy estimates of the discounted policy gradient in Equation (6).

Before proceeding, we will introduce the notion of a γ-just estimator of the advantage function,

which is an estimator that does not introduce bias when we use it in place of A

π,γ

(which is not

known and must be estimated) in Equation (6) to estimate g

Consider an advantage estimator

0:∞

, a

0:∞

), which may in general be a function of the entire trajectory.

Deﬁnition 1. The estimator

is γ-just if

0:∞

, a

0:∞

)∇

log π

| s

)

= E

0:∞

π,γ

, a

)∇

log π

| s

)] . (7)

It follows immediately that if

is γ-just for all t, then

0:∞

∞

t=0

0:∞

, a

0:∞

)∇

log π

| s

)

= g

(8)

One sufﬁcient condition for

to be γ-just is that

decomposes as the difference between two

functions Q

and b

, where Q

can depend on any trajectory variables but gives an unbiased estimator

of the γ-discounted Q-function, and b

is an arbitrary function of the states and actions sampled

before a

Proposition 1. Suppose that

can be written in the form

0:∞

, a

0:∞

) = Q

t:∞

, a

t:∞

) −

0:t

, a

0:t−1

) such that for all (s

, a

), E

t+1:∞

| s

t:∞

, a

t:∞

)] = Q

π,γ

, a

Then

A is γ-just.

Note, that we have already introduced bias by using A

π,γ

in place of A

; here we are concerned with

obtaining an unbiased estimate of g

, which is a biased estimate of the policy gradient of the undiscounted

MDP.

剩余13页未读，继续阅读

GanD.GanD

粉丝: 3

高维连续控制：利用广义优势估计强化学习

Asymptotic_Statistics

high dimensional statistics

generalized_advantage_estimation:多伦多大学STA4273 2021冬季课程介绍

Overview of three-dimensional shape measurement using optical methods.pdf

High-dimensional data analysis with low-dimensional models-2020.pdf

GUI for Multivariate Image Analysis of 4-dimensional data:Multivariate Image Analysis of 4-dimensional image sequence using 2-step 2-way and 3-way ...-matlab开发

High-Dimensional Probability.pdf

Stabilizing High-Dimensional Prediction Models.pdf

论文研究-Bilinear Backlund Transformations of Two (2 1)-Dimensional Nonlinear Evolution Equations using Symbolic Computation.pdf

Feature Selection for High-Dimensional Data.pdf

最新资源