Published as a conference paper at ICLR 2015
otherwise. The first case only happens in the most severe case of sparsity: when a gradient has
been zero at all timesteps except at the current timestep. For less sparse cases, the effective stepsize
will be smaller. When (1 − β
1
) =
√
1 − β
2
we have that |bm
t
/
√
bv
t
| < 1 therefore |∆
t
| < α. In
more common scenarios, we will have that bm
t
/
√
bv
t
≈ ±1 since |E[g]/
p
E[g
2
]| ≤ 1. The effective
magnitude of the steps taken in parameter space at each timestep are approximately bounded by
the stepsize setting α, i.e., |∆
t
| / α. This can be understood as establishing a trust region around
the current parameter value, beyond which the current gradient estimate does not provide sufficient
information. This typically makes it relatively easy to know the right scale of α in advance. For
many machine learning models, for instance, we often know in advance that good optima are with
high probability within some set region in parameter space; it is not uncommon, for example, to
have a prior distribution over the parameters. Since α sets (an upper bound of) the magnitude of
steps in parameter space, we can often deduce the right order of magnitude of α such that optima
can be reached from θ
0
within some number of iterations. With a slight abuse of terminology,
we will call the ratio bm
t
/
√
bv
t
the signal-to-noise ratio (SN R). With a smaller SNR the effective
stepsize ∆
t
will be closer to zero. This is a desirable property, since a smaller SNR means that
there is greater uncertainty about whether the direction of bm
t
corresponds to the direction of the true
gradient. For example, the SNR value typically becomes closer to 0 towards an optimum, leading
to smaller effective steps in parameter space: a form of automatic annealing. The effective stepsize
∆
t
is also invariant to the scale of the gradients; rescaling the gradients g with factor c will scale bm
t
with a factor c and bv
t
with a factor c
2
, which cancel out: (c · bm
t
)/(
√
c
2
· bv
t
) = bm
t
/
√
bv
t
.
3 INITIALIZATION BIAS CORRECTION
As explained in section 2, Adam utilizes initialization bias correction terms. We will here derive
the term for the second moment estimate; the derivation for the first moment estimate is completely
analogous. Let g be the gradient of the stochastic objective f , and we wish to estimate its second
raw moment (uncentered variance) using an exponential moving average of the squared gradient,
with decay rate β
2
. Let g
1
, ..., g
T
be the gradients at subsequent timesteps, each a draw from an
underlying gradient distribution g
t
∼ p(g
t
). Let us initialize the exponential moving average as
v
0
= 0 (a vector of zeros). First note that the update at timestep t of the exponential moving average
v
t
= β
2
·v
t−1
+ (1 −β
2
) ·g
2
t
(where g
2
t
indicates the elementwise square g
t
g
t
) can be written as
a function of the gradients at all previous timesteps:
v
t
= (1 − β
2
)
t
X
i=1
β
t−i
2
· g
2
i
(1)
We wish to know how E[v
t
], the expected value of the exponential moving average at timestep t,
relates to the true second moment E[g
2
t
], so we can correct for the discrepancy between the two.
Taking expectations of the left-hand and right-hand sides of eq. (1):
E[v
t
] = E
"
(1 − β
2
)
t
X
i=1
β
t−i
2
· g
2
i
#
(2)
= E[g
2
t
] · (1 −β
2
)
t
X
i=1
β
t−i
2
+ ζ (3)
= E[g
2
t
] · (1 −β
t
2
) + ζ (4)
where ζ = 0 if the true second moment E[g
2
i
] is stationary; otherwise ζ can be kept small since
the exponential decay rate β
1
can (and should) be chosen such that the exponential moving average
assigns small weights to gradients too far in the past. What is left is the term (1 − β
t
2
) which is
caused by initializing the running average with zeros. In algorithm 1 we therefore divide by this
term to correct the initialization bias.
In case of sparse gradients, for a reliable estimate of the second moment one needs to average over
many gradients by chosing a small value of β
2
; however it is exactly this case of small β
2
where a
lack of initialisation bias correction would lead to initial steps that are much larger.
3