DiCE: The Infinitely Differentiable Monte Carlo Estimator
Jakob Foerster
1
Gregory Farquhar
* 1
Maruan Al-Shedivat
* 2
Tim Rockt
¨
aschel
1
Eric P. Xing
2
Shimon Whiteson
1
Abstract
The score function estimator is widely used for
estimating gradients of stochastic objectives in
Stochastic Computation Graphs (SCG), e.g., in
reinforcement learning and meta-learning. While
deriving the first order gradient estimators by dif-
ferentiating a surrogate loss (SL) objective is com-
putationally and conceptually simple, using the
same approach for higher order gradients is more
challenging. Firstly, analytically deriving and im-
plementing such estimators is laborious and not
compliant with automatic differentiation. Sec-
ondly, repeatedly applying SL to construct new
objectives for each order gradient involves increas-
ingly cumbersome graph manipulations. Lastly,
to match the first order gradient under differentia-
tion, SL treats part of the cost as a fixed sample,
which we show leads to missing and wrong terms
for higher order gradient estimators. To address
all these shortcomings in a unified way, we in-
troduce DICE, which provides a single objective
that can be differentiated repeatedly, generating
correct gradient estimators of any order in SCGs.
Unlike SL, DICE relies on automatic differenti-
ation for performing the requisite graph manipu-
lations. We verify the correctness of DICE both
through a proof and through numerical evalua-
tion of the DICE gradient estimates. We also use
DICE to propose and evaluate a novel approach
for multi-agent learning. Our code is available at
https://goo.gl/xkkGxN.
1. Introduction
The score function trick is used to produce Monte Carlo
estimates of gradients in settings with non-differentiable ob-
jectives, e.g., in meta-learning and reinforcement learning.
Estimating the first order gradients is computationally and
*
Order determined by a dice roll.
1
University of Oxford
2
Carnegie Mellon University. Correspondence to:
Jakob Foerster <jakob.foerster@cs.ox.ac.uk>.
conceptually simple. While the gradient estimators can be
directly defined, it is often more convenient to define an
objective whose derivative is the gradient estimator and let
the powerful automatic-differentiation (auto-diff) toolbox
as implemented in deep learning libraries do the work for
you. This is the method used by the surrogate loss (SL)
approach (Schulman et al., 2015a), which provides a recipe
for building a surrogate objective from a stochastic computa-
tion graph (SCG). When differentiated, the SL produces an
estimator for the first order gradient of the original objective.
However, estimating higher order gradients is more challeng-
ing. Such estimators are useful for a number of optimization
techniques, accelerating convergence in supervised settings
(Dennis & Mor
´
e, 1977). Furthermore, they are vital for
gradient-based meta-learning (Finn et al., 2017; Al-Shedivat
et al., 2017; Li et al., 2017), which differentiates an objective
after some number of first order learning steps. Higher order
gradient estimators have also proven useful in multi-agent
learning (Foerster et al., 2018), when one agent differenti-
ates through the learning process of another agent.
Unfortunately, the first order gradient estimators mentioned
above are fundamentally ill-suited for calculating higher
order derivatives via auto-diff. Due to the dependency on
the sampling distribution, higher order gradient estimators
require repeated application of the score function trick. Sim-
ply differentiating the first order estimator again, as was for
example done by Finn et al. (2017), leads to missing terms.
To obtain higher order score function gradient estimators,
there are currently two unsatisfactory options. The first is
to analytically derive and implement the estimators. How-
ever, this is laborious, error prone, and does not comply
with the auto-diff paradigm. The second is to repeatedly
apply the SL approach to construct new objectives for each
further gradient estimate. However, constructing each of
these new objectives involves increasingly complex graph
manipulations, defeating the appeal of using a differentiable
surrogate loss.
Moreover, to match the first order gradient after a single
differentiation, the SL treats part of the cost as a fixed sam-
ple, severing the dependency on the parameters. We show
that this yields missing and incorrect terms in higher order
gradient estimators. We believe that these difficulties have
arXiv:1802.05098v1 [cs.LG] 14 Feb 2018