ADASHIFT：解决深度学习优化的非收敛问题与自适应学习率方法

需积分: 9 199 浏览量更新于2024-07-15 收藏 4.95MB PDF 举报

本文档"ADASHIFT: DECORRELATION AND CONVERGENCE OF ADAPTIVE LEARNING RATE METHODS"发表于2019年的ICLR会议上，针对深度学习优化算法的收敛性和稳定性进行了深入研究。作者们主要关注的是Adam优化器在某些情况下无法达到最优解的问题，这与Adam算法中梯度（gt）与二阶动量项（vt，随时间步长t更新）之间的不合适关联有关。传统上，Adam方法通过自适应调整学习率，根据每个参数的历史梯度和平方梯度估计来动态调整步长，旨在提高训练效率。然而，研究发现，这种关联可能导致大梯度对应的小步长，而小梯度则可能对应大步长，这种不平衡的步长分配是Adam非收敛问题的根本原因。作者提出了名为ADASHIFT的新见解，他们观察到在Adam以及其他自适应学习率方法中，gt和vt的关联可能是导致优化过程不稳定的关键因素。他们认为，通过减少这两者之间的相关性，可以使得每次梯度更新时的学习步长更加均匀，从而解决了Adam及其他类似方法的非收敛问题。论文中不仅探讨了这一理论，还提供了实验证据，通过在ResNet、DenseNet等深度神经网络模型以及CIFAR-10和Tiny-ImageNet等图像识别任务上的对比实验，展示了ADASHIFT策略在提升算法收敛速度和性能方面的显著效果。这些实验结果显示，通过 decorrelate vt和gt，Adam和其他优化器能够更有效地找到全局最优解，并且在实际应用中表现出更好的稳定性和效率。总结来说，这篇论文对深度学习优化器Adam的内在机制进行了剖析，揭示了其非收敛问题的根源，并提出了解决方案ADASHIFT，通过改善梯度和动量项的关联，促进了算法的稳定收敛，为改进深度学习的训练策略提供了重要的理论支持。对于深度学习从业者和研究者来说，这是一项值得深入研究和实践的成果。

Published as a conference paper at ICLR 2019

3.3 ANALYSIS ON NON-CONVERGENCE OF ADAM

As we have observed in the previous section, a common characteristic of these counterexamples

is that the net update factor for the gradient with large magnitude is smaller than these with small

magnitude. The above observation can also be interpreted as a direct consequence of inappropriate

correlation between v

and g

. Recall that v

= β

t−1

+(1−β

. Assuming v

t−1

is independent

of g

, then: when a new gradient g

arrives, if g

is large, v

is likely to be larger; and if g

is small,

is also likely to be smaller. If β

= 0, then k(g

) = α

√

. As a result, a large gradient is likely

to have a small net update factor, while a small gradient is likely to have a large net update factor in

Adam.

When it comes to the scenario where β

> 0, the arguments are actually quite similar. Given v

t−1

+(1−β

. Assuming v

t−1

and {g

t+i

}

∞

i=1

are independent from g

, then: not only does v

positively correlate with the magnitude of g

, but also the entire inﬁnite sequence {v

}

∞

i=t

positively

correlates with the magnitude of g

. Since the net update factor k(g

) =

∞

i=t

√

(1−β

)β

i−t

negatively correlates with each v

in {v

}

∞

i=t

, it is thus negatively correlated with the magnitude of

. That is, k(g

) for a large gradient is likely to be smaller, while k(g

) for a small gradient is likely

to be larger.

The unbalanced net update factors cause the non-convergence problem of Adam as well as all other

adaptive learning rate methods where v

correlates with g

. To construct a counterexample, the same

pattern is that: the large gradient is along the “correct” direction, while the small gradient is along

the opposite direction. Due to the fact that the accumulated inﬂuence of a large gradient is small

while the accumulated inﬂuence of a small gradient is large, Adam may update parameters along

the wrong direction.

Finally, we would like to emphasize that even if Adam updates parameters along the right direc-

tion in general, the unbalanced net update factors are still unfavorable since they slow down the

convergence.

4 THE PROPOSED METHOD: DECORRELATION VIA TEMPORAL SHIFTING

According to the previous discussion, we conclude that the main cause of the non-convergence of

Adam is the inappropriate correlation between v

and g

. Currently we have two possible solutions:

(1) making v

act like a constant, which declines the correlation, e.g., using a large β

or keep v

non-

decreasing (Reddi et al., 2018); (2) using a large β

(Theorem 1), where the aggressive momentum

term helps to mitigate the impact of unbalanced net update factors. However, neither of them solves

the problem fundamentally.

The dilemma caused by v

enforces us to rethink its role. In adaptive learning rate methods, v

plays the role of estimating the second moments of gradients, which reﬂects the scale of gradient on

average. With the adaptive learning rate α

√

, the update step of g

is scaled down by

√

and

achieves rescaling invariance with respect to the scale of g

, which is practically useful to make the

training process easy to control and the training system robust. However, the current scheme of v

i.e., v

= β

t−1

+ (1 − β

, brings a positive correlation between v

and g

, which results in

reducing the effect of large gradients and increasing the effect of small gradients, and ﬁnally causes

the non-convergence problem. Therefore, the key is to let v

be a quantity that reﬂects the scale of

the gradients, while at the same time, be decorrelated with current gradient g

. Formally, we have

the following theorem:

Theorem 5 (Decorrelation leads to convergence). For any ﬁxed online optimization problem with

inﬁnitely repeating of a ﬁnite set of cost functions {f

(θ), . . . , f

(θ), . . . f

(θ)}, assuming β

= 0

and α

is ﬁxed, we have, if v

follows a ﬁxed distribution and is independent of the current gradient

, then the expected net update factor for each gradient is identical.

Let P

denote the distribution of v

. In the inﬁnitely repeating online optimization scheme, the

expectation of net update factor for each gradient g

E[k(g

)] =

∞

i=t

∼P

[

√

(1 − β

)β

i−t

]. (13)

Published as a conference paper at ICLR 2019

Given P

is independent of g

, the expectation of the net update factor E[k(g

)] is independent of

and remains the same for different gradients. With the expected net update factor being a ﬁxed

constant, the convergence of the adaptive learning rate method reduces to vanilla SGD.

Momentum (Qian, 1999) can be viewed as setting v

as a constant, which makes v

and g

indepen-

dent. Furthermore, in our view, using an increasing β

(AdamNC) or keeping ˆv

as the largest v

(AMSGrad) is also to make v

almost ﬁxed. However, ﬁxing v

is not a desirable solution, because

it damages the adaptability of Adam with respect to the adapting of step size.

We next introduce the proposed solution to make v

independent of g

, which is based on temporal

independent assumption among gradients. We ﬁrst introduce the idea of temporal decorrelation,

then extend our solution to make use of the spatial information of gradients. Finally, we incorporate

ﬁrst moment estimation. The pseudo code of the proposed algorithm is presented as follows.

Algorithm 1 AdaShift: Temporal Shifting with Block-wise Spatial Operation

Input: n, β

, β

, φ, θ

, {f

(θ)}

t=1

, {α

}

t=1

, {g

−t

}

n−1

t=0

1: set v

= 0

2: for t = 1 to T do

3: g

= ∇f

(θ

)

4: m

n−1

i=0

t−i

n−1

i=0

5: for i = 1 to M do

6: v

[i] = β

t−1

[i] + (1 − β

)φ(g

t−n

[i])

7: θ

[i] = θ

t−1

[i] − α

[i] · m

[i]

8: end for

9: end for

10: // We ignore the bias-correction, epsilon and other misc for the sake of clarity

4.1 TEMPORAL DECORRELATION

In practical setting, f

(θ) usually involves different mini-batches x

, i.e., f

(θ) = f(θ; x

). Given

the randomness of mini-batch, we assume that the mini-batch x

is independent of each other and

further assume that f (θ; x) keeps unchanged over time, then the gradient g

= ∇f(θ; x

) of each

mini-batch is independent of each other.

Therefore, we could change the update rule for v

to involve g

t−n

instead of g

, which makes v

and

temporally shifted and hence decorrelated:

= β

t−1

+ (1 − β

t−n

. (14)

Note that in the sequential online optimization problem, the assumption “g

is independent of each

other” does not hold. However, in the stochastic online optimization problem and practical neural

network settings, our assumption generally holds.

4.2 MAKING USE OF THE SPATIAL ELEMENTS OF PREVIOUS TIMESTEPS

Most optimization schemes involve a great many parameters. The dimension of θ is high, thus g

and

are also of high dimension. However, v

is element-wisely computed in Equation 14. Speciﬁcally,

we only use the i-th dimension of g

t−n

to calculate the i-th dimension of v

. In other words, it only

makes use of the independence between g

t−n

[i] and g

[i], where g

[i] denotes the i-th element of

. Actually, in the case of high-dimensional g

and v

, we can further assume that all elements of

gradient g

t−n

at previous timesteps are independent with the i-th dimension of g

. Therefore, all

elements in g

t−n

can be used to compute v

without introducing correlation. To this end, we propose

introducing a function φ over all elements of g

t−n

, i.e.,

= β

t−1

+ (1 − β

)φ(g

t−n

). (15)

For easy reference, we name the elements of g

t−n

other than g

t−n

[i] as the spatial elements of g

t−n

and name φ the spatial function or spatial operation. There is no restriction on the choice of φ, and

we use φ(x) = max

x[i] for most of our experiments, which is shown to be a good choice. The

max

x[i] operation has a side effect that turns the adaptive learning rate v

into a shared scalar.

剩余25页未读，继续阅读

小乐&小蓝的house

粉丝: 0

ADASHIFT：解决深度学习优化的非收敛问题与自适应学习率方法

MUSIC-based-decorrelation-.rar_music_toeplitz_基于解相干_奇异值分解_等Toepl

lms_descorr__2.zip_matlab filterdes

MLAMBDA.zip_GPS_GPS lambda_lambda_lambda 模糊度_模糊度

bsscode.rar_ACDC_UWEDGE_bsscode_盲源分离_盲源分离SOBI

lambda.rar_GPS编程_C/C++_

scal.rar_allpass_comb filter

41695062kuaidaixinhaoDOA.zip_2OGL_root music圆阵_圆阵 相干_圆阵 算法_稀疏

LAMBDA.rar_GPS LAMBDA算法_figure1v3_lambda_lambda算法_差分 GPS

CDMAmud.rar_CDMA多用户检测_MMSE多用户_mmse_omd_多用户检测

LAMBDA.rar_人工智能/神经网络/深度学习_Fortran_

最新资源

41695062kuaidaixinhaoDOA.zip_2OGL_root music圆阵_圆阵相干_圆阵算法_稀疏