Adam优化器：2015年ICLR会议论文的高效学习方法

需积分: 15 127 浏览量更新于2024-07-15 收藏 605KB PDF 举报

Adam Optimizer是一种先进的随机梯度下降方法，由Diederik P. Kingma和Jimmy Lei Ba于2015年在国际机器学习会议上发表的论文《Adam: A Method for Stochastic Optimization》中首次提出。这篇论文是ICLR 2015年度会议的一部分，其名称中的"Adam"并非首字母缩写，而是源自"Adaptive Moment Estimation"（自适应动量估计）的首字母组合。 Adam算法的设计初衷是为了优化处理大规模数据集或具有大量参数的机器学习问题。它基于对梯度的一阶和二阶统计估计，即动量（momentum）和自适应学习率，这两个关键特性使其在训练深度学习模型时表现得尤为高效。与传统的随机梯度下降（SGD）相比，Adam通过自适应地调整每个参数的学习速率，能够更好地处理非平稳目标函数和噪声较大的梯度。算法的核心思想是使用指数移动平均来估计梯度的一阶动量（第一矩）和二阶动量（第二矩），分别表示当前位置相对于历史位置的平均移动方向和变化速度。这允许Adam对不同参数进行动态调整，对于那些在训练过程中梯度变化较大的参数，算法会相应地减小其学习步长，从而提高收敛速度。此外，算法还包括了偏差校正机制，以减少动量估计的偏差。 Adam算法的实现简单且计算效率高，内存需求相对较小，并且对梯度的规范化处理使得它对输入数据的预处理要求不那么严格。它的超参数设置直观，通常无需大量调整，如学习率、动量参数和二阶动量衰减因子等。此外，论文还探讨了Adam算法与其他相关优化算法的关系，以及在在线凸优化框架下的理论收敛性分析，提供了一个与已知结果相当的收敛率的后悔边界。在实践中，Adam被广泛应用于深度学习的各种场景，包括但不限于卷积神经网络（CNN）、循环神经网络（RNN）和Transformer等模型的训练，因其稳定性和性能优越性而备受青睐。通过对比实验，Adam经常能展现出优于其他优化器的性能，尤其是在存在噪声和稀疏梯度的情况下。Adam Optimizer是现代机器学习和深度学习中不可或缺的优化工具之一。

Published as a conference paper at ICLR 2015

otherwise. The ﬁrst case only happens in the most severe case of sparsity: when a gradient has

been zero at all timesteps except at the current timestep. For less sparse cases, the effective stepsize

will be smaller. When (1 − β

) =

√

1 − β

we have that |bm

√

| < 1 therefore |∆

| < α. In

more common scenarios, we will have that bm

√

≈ ±1 since |E[g]/

E[g

]| ≤ 1. The effective

magnitude of the steps taken in parameter space at each timestep are approximately bounded by

the stepsize setting α, i.e., |∆

| / α. This can be understood as establishing a trust region around

the current parameter value, beyond which the current gradient estimate does not provide sufﬁcient

information. This typically makes it relatively easy to know the right scale of α in advance. For

many machine learning models, for instance, we often know in advance that good optima are with

high probability within some set region in parameter space; it is not uncommon, for example, to

have a prior distribution over the parameters. Since α sets (an upper bound of) the magnitude of

steps in parameter space, we can often deduce the right order of magnitude of α such that optima

can be reached from θ

within some number of iterations. With a slight abuse of terminology,

we will call the ratio bm

√

the signal-to-noise ratio (SN R). With a smaller SNR the effective

stepsize ∆

will be closer to zero. This is a desirable property, since a smaller SNR means that

there is greater uncertainty about whether the direction of bm

corresponds to the direction of the true

gradient. For example, the SNR value typically becomes closer to 0 towards an optimum, leading

to smaller effective steps in parameter space: a form of automatic annealing. The effective stepsize

∆

is also invariant to the scale of the gradients; rescaling the gradients g with factor c will scale bm

with a factor c and bv

with a factor c

, which cancel out: (c · bm

)/(

√

· bv

) = bm

√

3 INITIALIZATION BIAS CORRECTION

As explained in section 2, Adam utilizes initialization bias correction terms. We will here derive

the term for the second moment estimate; the derivation for the ﬁrst moment estimate is completely

analogous. Let g be the gradient of the stochastic objective f , and we wish to estimate its second

raw moment (uncentered variance) using an exponential moving average of the squared gradient,

with decay rate β

. Let g

, ..., g

be the gradients at subsequent timesteps, each a draw from an

underlying gradient distribution g

∼ p(g

). Let us initialize the exponential moving average as

= 0 (a vector of zeros). First note that the update at timestep t of the exponential moving average

= β

·v

t−1

+ (1 −β

) ·g

(where g

indicates the elementwise square g

g

) can be written as

a function of the gradients at all previous timesteps:

= (1 − β

)

i=1

t−i

· g

(1)

We wish to know how E[v

], the expected value of the exponential moving average at timestep t,

relates to the true second moment E[g

], so we can correct for the discrepancy between the two.

Taking expectations of the left-hand and right-hand sides of eq. (1):

E[v

] = E

(1 − β

)

i=1

t−i

· g

(2)

= E[g

] · (1 −β

)

i=1

t−i

+ ζ (3)

= E[g

] · (1 −β

) + ζ (4)

where ζ = 0 if the true second moment E[g

] is stationary; otherwise ζ can be kept small since

the exponential decay rate β

can (and should) be chosen such that the exponential moving average

assigns small weights to gradients too far in the past. What is left is the term (1 − β

) which is

caused by initializing the running average with zeros. In algorithm 1 we therefore divide by this

term to correct the initialization bias.

In case of sparse gradients, for a reliable estimate of the second moment one needs to average over

many gradients by chosing a small value of β

; however it is exactly this case of small β

where a

lack of initialisation bias correction would lead to initial steps that are much larger.

剩余14页未读，继续阅读

Quant0xff

粉丝: 1w+
资源: 459

Adam优化器：2015年ICLR会议论文的高效学习方法

手绘DL.pdf内容解析与信息技术应用

"Apm无人机主程序分析.pdf：贡献者致谢及功能概述

"数据结构课程设计.pdf演示成果和考核详情

Data Structure And Algorithms In C++ 2nd ed - Adam Drozdek.pdf

ADAM-3016.pdf

15-1 Lec1_SGD-Adam.pdf

藏经阁-Processing Terabyte Scale Genomics Datasets with ADAM.pdf

ADAM模块设置指南.pdf

IAG_FAQ_ADAM-4541_The way to connect RS-232 and RS-485 by ADAM-4541.pdf

ADAM114117模块使用手册.pdf

最新资源