超速收敛：大学习率下神经网络快速训练现象

论文

需积分: 0 95 浏览量更新于2024-07-18 收藏 2.88MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

order to obtain good performance. The general principle is: the amount of regularization must be

balanced for each dataset and architecture. Recognition of this principle permits general use of

super-convergence. Reducing other forms of regularization and regularizing with very large learning

rates makes training signiﬁcantly more efﬁcient.

4 Estimating optimal learning rates

Gradient or steepest descent is an optimization method that uses the slope as computed by the

derivative to move in the direction of greatest negative gradient to iteratively update a variable. That

is, given an initial point x

, gradient descent proposes the next point to be:

x = x

−  5

f(x) (1)

where



is the step size or learning rate . If we denote the parameters in a neural network (i.e., weights)

θ ∈ R

and

f(θ)

is the loss function, we can apply gradient descent to learn the weights of a

network; i.e., with input x, a solution y, and non-linearity σ:

y = f (θ) = σ(W

σ(W

l−1

σ(W

l−2

...σ(W

x + b

)... + b

) (2)

where W

∈ θ are the weights for layer l and b

∈ θ are biases for layer l.

The Hessian-free optimization method [Martens, 2010] suggests a second order solution that utilizes

the slope information contained in the second derivative (i.e., the derivative of the gradient

f(θ)

From Martens [2010], the main idea of the second order Newton’s method is that the loss function

can be locally approximated by the quadratic as:

f(θ) ≈ f(θ

) + (θ − θ

)

f(θ

) +

(θ − θ

)

H(θ − θ

) (3)

where

is the Hessian, or the second derivative matrix of

f(θ

)

. Writing Equation 1 to update the

parameters at iteration i as:

i+1

= θ

−  5

f(θ

) (4)

allows Equation 3 to be re-written as:

f(θ

−  5

f(θ

)) ≈ f(θ

) + (θ

i+1

− θ

)

f(θ

) +

(θ

i+1

− θ

)

H(θ

i+1

− θ

) (5)

In general it is not feasible to compute the Hessian matrix, which has

Ω(N

)

elements, where

is the number of parameters in the network, but it is unnecessary to compute the full Hessian. The

Hessian expresses the curvature in all directions in a high dimensional space, but the only relevant

curvature direction is in the direction of steepest descent that SGD will traverse. This concept is

contained within Hessian-free optimization, as Martens [2010] suggests a ﬁnite difference approach

for obtaining an estimate of the Hessian from two gradients:

H(θ) = lim

δ→0

5f(θ + δ) − 5f(θ)

(6)

where

should be in the direction of the steepest descent. The AdaSecant method [Gulcehre et al.,

2014, 2017] builds an adaptive learning rate method based on this ﬁnite difference approximation as:



∗

≈

i+1

− θ

5f(θ

i+1

) − 5f(θ

)

(7)

where



∗

represents the optimal learning rate for each of the neurons. Utilizing Equation 4, we rewrite

Equation 7 in terms of the differences between the weights from three sequential iterations as:



∗

= 

i+1

− θ

2θ

i+1

− θ

i+2

(8)

where



on the right hand side is the learning rate value actually used in the calculations to update

the weights. Equation 8 is an expression for an adaptive learning rate for each weight update. We

borrow the method in Schaul et al. [2013] to obtain an estimate of the global learning rate from the

weight speciﬁc rates by summing over the numerator and denominator, with one minor difference. In

Schaul et al. [2013] their expression is squared, leading to positive values – therefore we sum the

剩余17页未读，继续阅读

啊啊亲亲

粉丝: 0
资源: 2

超速收敛：大学习率下神经网络快速训练现象

论文 Adaptive convergence nonuniformity correction algorithm

怎样调用plot_convergence

信号convergence 和 信号re-convergence 分别是什么意思

Convergence curve 什么意思

plot_convergence

Warning: convergence tolerance of 1.000000e-06 not reached

martingale convergence的maximal inequality是什么

When {µk} is upper bounded, the convergence of Algorithm 5 is already proved by He et al.知道这篇文献吗？

Speed up convergence

pt dc convergence is used

How to prove the cauchy principle of convergence

Loop until convergence

sequential convergence

function [GD]=convergence(obj,ref_point)%

ansys-guidelines-contact-convergence

[Fbest,Lbest,Convergence_curve]=IGWO(dim,N,Max_iteration,lb,ub,fobj);

print(pFit) print(X) print(fMin) print(bestI) print(bestX) print(Convergence_curve) 怎么在Python中一次性全部注释掉？

function [fMin, bestX, Convergence_curve] = SSA_adaptive_bounds_GA(M, pop, c, d, dim, net, P, T,opt_params)

最新资源

信号convergence 和信号re-convergence 分别是什么意思