DiCE：无界微分蒙特卡洛估计器

需积分: 13 66 浏览量更新于2024-09-08 收藏 419KB PDF 举报

"DiCE（无穷可微的蒙特卡洛估计器）是一种解决在随机计算图（SCG）中估计随机目标函数梯度的方法，特别是在强化学习和元学习领域。传统的得分函数估计器用于计算一阶梯度，但扩展到高阶梯度时会遇到困难，包括分析和实现上的复杂性，以及自动微分不兼容等问题。DiCE作为解决方案，提供了一种统一的方法来解决这些问题，使得估计高阶梯度变得更加有效和准确。" 在Stochastic Computation Graphs（SCG）中，随机目标函数的梯度估计是关键，因为这些图形在强化学习和元学习等应用中非常常见。传统的得分函数估计算法在计算第一阶梯度时表现出色，它通过微分一个代理损失（Surrogate Loss, SL）目标来实现，这个过程在计算上和概念上都相对简单。然而，当尝试用相同的方法去估计更高阶的梯度时，问题就显现了。首先，分析并实现这些高阶梯度估计器是费时费力的，并且它们通常与现有的自动微分工具不兼容。其次，为了构建每个阶数的梯度目标，需要多次应用SL，这会导致复杂的图操作，使得处理变得日益繁琐。最后，SL在微分过程中将部分成本视为固定样本，这会导致在估计高阶梯度时丢失或错误的项。为了解决以上所有挑战，DiCE（无穷可微的蒙特卡洛估计器）被引入。DiCE提供了一个单一的框架，不仅简化了高阶梯度的估计过程，还解决了自动微分的兼容性问题。它旨在匹配第一阶梯度的估计，同时正确处理高阶梯度中的所有项，从而提高估计的准确性和效率。通过这种方式，DiCE增强了我们在SCG环境中对复杂随机过程的理解和控制，对于优化算法和学习策略的开发具有重要意义。在强化学习中，准确估计梯度对于策略优化至关重要，而元学习则需要快速适应新任务，这通常涉及到对学习算法本身的梯度更新。DiCE的出现，为这两个领域的研究提供了新的工具和理论基础，有望推动更高效、更精确的学习算法的发展。同时，由于其设计考虑到了与自动微分系统的兼容性，它也为深度学习和其他依赖梯度计算的领域开辟了新的可能性。

DiCE: The Inﬁnitely Differentiable Monte Carlo Estimator

Jakob Foerster

Gregory Farquhar

* 1

Maruan Al-Shedivat

* 2

Tim Rockt

aschel

Eric P. Xing

Shimon Whiteson

Abstract

The score function estimator is widely used for

estimating gradients of stochastic objectives in

Stochastic Computation Graphs (SCG), e.g., in

reinforcement learning and meta-learning. While

deriving the ﬁrst order gradient estimators by dif-

ferentiating a surrogate loss (SL) objective is com-

putationally and conceptually simple, using the

same approach for higher order gradients is more

challenging. Firstly, analytically deriving and im-

plementing such estimators is laborious and not

compliant with automatic differentiation. Sec-

ondly, repeatedly applying SL to construct new

objectives for each order gradient involves increas-

ingly cumbersome graph manipulations. Lastly,

to match the ﬁrst order gradient under differentia-

tion, SL treats part of the cost as a ﬁxed sample,

which we show leads to missing and wrong terms

for higher order gradient estimators. To address

all these shortcomings in a uniﬁed way, we in-

troduce DICE, which provides a single objective

that can be differentiated repeatedly, generating

correct gradient estimators of any order in SCGs.

Unlike SL, DICE relies on automatic differenti-

ation for performing the requisite graph manipu-

lations. We verify the correctness of DICE both

through a proof and through numerical evalua-

tion of the DICE gradient estimates. We also use

DICE to propose and evaluate a novel approach

for multi-agent learning. Our code is available at

https://goo.gl/xkkGxN.

1. Introduction

The score function trick is used to produce Monte Carlo

estimates of gradients in settings with non-differentiable ob-

jectives, e.g., in meta-learning and reinforcement learning.

Estimating the ﬁrst order gradients is computationally and

Order determined by a dice roll.

University of Oxford

Carnegie Mellon University. Correspondence to:

Jakob Foerster <jakob.foerster@cs.ox.ac.uk>.

conceptually simple. While the gradient estimators can be

directly deﬁned, it is often more convenient to deﬁne an

objective whose derivative is the gradient estimator and let

the powerful automatic-differentiation (auto-diff) toolbox

as implemented in deep learning libraries do the work for

you. This is the method used by the surrogate loss (SL)

approach (Schulman et al., 2015a), which provides a recipe

for building a surrogate objective from a stochastic computa-

tion graph (SCG). When differentiated, the SL produces an

estimator for the ﬁrst order gradient of the original objective.

However, estimating higher order gradients is more challeng-

ing. Such estimators are useful for a number of optimization

techniques, accelerating convergence in supervised settings

(Dennis & Mor

e, 1977). Furthermore, they are vital for

gradient-based meta-learning (Finn et al., 2017; Al-Shedivat

et al., 2017; Li et al., 2017), which differentiates an objective

after some number of ﬁrst order learning steps. Higher order

gradient estimators have also proven useful in multi-agent

learning (Foerster et al., 2018), when one agent differenti-

ates through the learning process of another agent.

Unfortunately, the ﬁrst order gradient estimators mentioned

above are fundamentally ill-suited for calculating higher

order derivatives via auto-diff. Due to the dependency on

the sampling distribution, higher order gradient estimators

require repeated application of the score function trick. Sim-

ply differentiating the ﬁrst order estimator again, as was for

example done by Finn et al. (2017), leads to missing terms.

To obtain higher order score function gradient estimators,

there are currently two unsatisfactory options. The ﬁrst is

to analytically derive and implement the estimators. How-

ever, this is laborious, error prone, and does not comply

with the auto-diff paradigm. The second is to repeatedly

apply the SL approach to construct new objectives for each

further gradient estimate. However, constructing each of

these new objectives involves increasingly complex graph

manipulations, defeating the appeal of using a differentiable

surrogate loss.

Moreover, to match the ﬁrst order gradient after a single

differentiation, the SL treats part of the cost as a ﬁxed sam-

ple, severing the dependency on the parameters. We show

that this yields missing and incorrect terms in higher order

gradient estimators. We believe that these difﬁculties have

arXiv:1802.05098v1 [cs.LG] 14 Feb 2018

下载后可阅读完整内容，剩余9页未读，立即下载

peter_wwhe

粉丝: 2
资源: 14

DiCE：无界微分蒙特卡洛估计器

dice:一个用于滚动骰子和分析概率分布的Scala库

MobileUnet 对Liver分割实战代码

mIoU：0.8783 Acc：0.9973 Kappa：0.8619 Dice：0.9309，这个FCN训练出来的结果评价

上面的代码出现了报错：Traceback·(most·recent·call·last): ··File·"/tmp/a.py",·line·28,·in·<module> ····r·=·d.rollDice() AttributeError:·'Dice'·object·has·no·attribute·'rollDice'

最新资源