深度学习优化：解密自适应梯度法与学习率的相互作用

下载需积分: 16 | PDF格式 | 1015KB | 更新于2024-07-15 | 68 浏览量 | 举报

"从学习速率中解开自适应梯度法（Disentangling Adaptive Gradient）.pdf" 本文主要探讨了深度学习优化算法评估中的关键问题，尤其是自适应梯度方法与学习速率调整之间的相互作用。学习速率是神经网络训练过程中一个极其重要的超参数，它对模型的收敛速度和泛化能力有着显著的影响。作者们通过引入了一种名为"嫁接"的实验方法，来分离更新的大小（即步长）和方向，以此来揭示一些以往研究中可能被忽视的细节。 "Disentangling Adaptive Gradient Methods from Learning Rates" 是该研究的核心主题，意在剖析自适应梯度算法如何独立于学习率工作。自适应梯度方法，如Adagrad、RMSprop、Adam等，通过动态调整每个参数的学习率来优化模型，这些方法在处理具有不同尺度的参数时表现优秀。然而，它们的内部机制常常与学习率的调整混淆，导致对这些方法的理解存在误区。在"嫁接"实验中，研究人员将更新的幅度与方向分离开，揭示出许多关于自适应梯度方法的现有观点可能源于对学习率时间表的隔离不足。这种实验设计使得研究者能够更准确地评估每个因素的单独效果，从而对算法的性能有更深入的理解。此外，论文还对自适应梯度方法的泛化能力进行了实证和理论的回顾。泛化能力是衡量模型在未见数据上的表现，这对于实际应用至关重要。作者们的目标是提供一个清晰的视角，帮助理解这些优化方法为何以及在何种情况下能实现良好的泛化。通过这些研究，作者们期望能为深度学习社区提供更有力的工具和洞察，以便于更好地理解和调整这些算法，从而提高模型的训练效率和泛化性能。这不仅有助于优化现有的神经网络架构，也可能启发新的优化策略的开发，进一步推动深度学习领域的发展。

展开

of M’s step and direction of D’s step.

Algorithm 2 AdaGraft meta-optimizer

1: Input: Optimizers M, D; initializer w

;  > 0.

2: Initialize M, D at w

3: for t = 1, . . . , T do

4: Receive stochastic gradient g

at w

5: Query steps from M and D:

:= M(w

, g

), w

:= D(w

, g

6: Update with grafted step:

t +1

← w

– w

k + 

· (w

– w

) .

7: end for

Layer-wise vs. global grafting.

Two natural variants of Algorithm 2 come to mind, especially if one is

concerned about eﬃcient implementation (see Appendix C.1). In the layer-wise version, we view w

as a

single parameter group (usually a tensor-shaped variable speciﬁed by the architecture), and apply AdaGraft

and its child optimizers to each group. In the global version, w

contains all of the model’s weights. We

discuss and evaluate both variants, but our main experimental results use layer-wise grafting.

The ﬁrst experimental question addressed by AdaGraft is the following: To what degree does an optimizer’s

implicit step size schedule determine its training curve? To this end, given a set of base optimizers, we can

perform training runs for all pairs (

), where grafting (

) is understood as simply running

. For the

main experiments, we use SGD, Adam, and AdaGrad, all with momentum β

= 0.9.

All experiments were carried out on 32 cores of a TPU-v3 Pod [

JYP

], using the Lingvo [

SNW

]

sequence-to-sequence framework built on top of TensorFlow [ABC

16].

3.2 ImageNet classiﬁcation experiments

We ran all pairs of grafted optimizers on a 50-layer residual network [

HZRS16

] with 26M parameters, trained

on ImageNet classiﬁcation [

DDS

]. We used a batch size of 4096, enabled by the large-scale training

infrastructure, and a learning rate schedule consisting of a linear warmup and stepwise exponential decay. All

details can be found in Appendix C.2.

Table 2 shows top-1 and top-5 accuracies at convergence. The ﬁnal accuracies at convergence, as well as

training loss curves, are very stable (

0.1% deviation) across runs, due to the large batch size. Figure 1

shows at a glance our main empirical observation: that the shapes of the training curves are clustered by the

choice of M, the optimizer which supplies the step magnitude.

We stress that no additional hyperparameter tuning was done in these experiments; not even the global

scalar learning rate needed adjustment. Thus, starting with N tuned optimizer setups, grafting produces a

table of N

setups with no additional eﬀort. Each row of this table controls for the implicit step size schedule.

M 



 D SGD Adam AdaGrad

SGD 75.4 · 92.6 72.8 · 91.2 73.7 · 91.4

Adam 74.1 · 91.9 73.0 · 91.3 73.7 · 91.6

AdaGrad 65.0 · 85.9 65.1 · 86.0 65.3 · 86.3

Table 2: Top-1 and top-5 accuracies at training step t = 50K for ImageNet experiments. Averaged over 3

trials; no accuracy varied by more than 0.1%.

剩余25页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

syp_net

粉丝: 158

深度学习优化：解密自适应梯度法与学习率的相互作用

CVD-Physiological-Measurement:通过交叉验证的特征解缠，基于视频的远程生理测量。 （ECCV2020口头）

Python-学习率动态界限的自适应梯度法的简单Tensorflow实现

陕西省2025年初中学业水平考试实验操作考试试题及评分细则.zip

Halcon与C#结合的机器视觉开发：经典案例解析与最佳实践

西门子S7-1200 PLC污水处理系统：博途V17版KTp1200屏程序设计与优化

MATLAB实现改进带约束粒子群优化算法(IPSO)及其工程应用

基于分布式ADMM算法与碳排放交易的MATLAB代码：电力系统优化调度

【计算机软考】初级程序员面试题汇总：面向对象特性、Java基础及多线程编程详解了计算机软考

Java常用API详解

基于python的智能网联车辆和人工驾驶车辆混合行驶异质交通流特性研究+源码（期末大作业）

最新资源

CVD-Physiological-Measurement:通过交叉验证的特征解缠，基于视频的远程生理测量。（ECCV2020口头）