![](https://csdnimg.cn/release/download_crawler_static/12439689/bg5.jpg)
Published as a conference paper at ICLR 2019
3.3 ANALYSIS ON NON-CONVERGENCE OF ADAM
As we have observed in the previous section, a common characteristic of these counterexamples
is that the net update factor for the gradient with large magnitude is smaller than these with small
magnitude. The above observation can also be interpreted as a direct consequence of inappropriate
correlation between v
t
and g
t
. Recall that v
t
= β
2
v
t−1
+(1−β
2
)g
2
t
. Assuming v
t−1
is independent
of g
t
, then: when a new gradient g
t
arrives, if g
t
is large, v
t
is likely to be larger; and if g
t
is small,
v
t
is also likely to be smaller. If β
1
= 0, then k(g
t
) = α
t
/
√
v
t
. As a result, a large gradient is likely
to have a small net update factor, while a small gradient is likely to have a large net update factor in
Adam.
When it comes to the scenario where β
1
> 0, the arguments are actually quite similar. Given v
t
=
β
2
v
t−1
+(1−β
2
)g
2
t
. Assuming v
t−1
and {g
t+i
}
∞
i=1
are independent from g
t
, then: not only does v
t
positively correlate with the magnitude of g
t
, but also the entire infinite sequence {v
i
}
∞
i=t
positively
correlates with the magnitude of g
t
. Since the net update factor k(g
t
) =
P
∞
i=t
α
i
/
√
v
i
(1−β
1
)β
i−t
1
negatively correlates with each v
i
in {v
i
}
∞
i=t
, it is thus negatively correlated with the magnitude of
g
t
. That is, k(g
t
) for a large gradient is likely to be smaller, while k(g
t
) for a small gradient is likely
to be larger.
The unbalanced net update factors cause the non-convergence problem of Adam as well as all other
adaptive learning rate methods where v
t
correlates with g
t
. To construct a counterexample, the same
pattern is that: the large gradient is along the “correct” direction, while the small gradient is along
the opposite direction. Due to the fact that the accumulated influence of a large gradient is small
while the accumulated influence of a small gradient is large, Adam may update parameters along
the wrong direction.
Finally, we would like to emphasize that even if Adam updates parameters along the right direc-
tion in general, the unbalanced net update factors are still unfavorable since they slow down the
convergence.
4 THE PROPOSED METHOD: DECORRELATION VIA TEMPORAL SHIFTING
According to the previous discussion, we conclude that the main cause of the non-convergence of
Adam is the inappropriate correlation between v
t
and g
t
. Currently we have two possible solutions:
(1) making v
t
act like a constant, which declines the correlation, e.g., using a large β
2
or keep v
t
non-
decreasing (Reddi et al., 2018); (2) using a large β
1
(Theorem 1), where the aggressive momentum
term helps to mitigate the impact of unbalanced net update factors. However, neither of them solves
the problem fundamentally.
The dilemma caused by v
t
enforces us to rethink its role. In adaptive learning rate methods, v
t
plays the role of estimating the second moments of gradients, which reflects the scale of gradient on
average. With the adaptive learning rate α
t
/
√
v
t
, the update step of g
t
is scaled down by
√
v
t
and
achieves rescaling invariance with respect to the scale of g
t
, which is practically useful to make the
training process easy to control and the training system robust. However, the current scheme of v
t
,
i.e., v
t
= β
2
v
t−1
+ (1 − β
2
)g
2
t
, brings a positive correlation between v
t
and g
t
, which results in
reducing the effect of large gradients and increasing the effect of small gradients, and finally causes
the non-convergence problem. Therefore, the key is to let v
t
be a quantity that reflects the scale of
the gradients, while at the same time, be decorrelated with current gradient g
t
. Formally, we have
the following theorem:
Theorem 5 (Decorrelation leads to convergence). For any fixed online optimization problem with
infinitely repeating of a finite set of cost functions {f
1
(θ), . . . , f
t
(θ), . . . f
n
(θ)}, assuming β
1
= 0
and α
t
is fixed, we have, if v
t
follows a fixed distribution and is independent of the current gradient
g
t
, then the expected net update factor for each gradient is identical.
Let P
v
denote the distribution of v
t
. In the infinitely repeating online optimization scheme, the
expectation of net update factor for each gradient g
t
is
E[k(g
t
)] =
∞
X
i=t
E
v
i
∼P
v
[
α
i
√
v
i
(1 − β
1
)β
i−t
1
]. (13)
5