order to obtain good performance. The general principle is: the amount of regularization must be
balanced for each dataset and architecture. Recognition of this principle permits general use of
super-convergence. Reducing other forms of regularization and regularizing with very large learning
rates makes training significantly more efficient.
4 Estimating optimal learning rates
Gradient or steepest descent is an optimization method that uses the slope as computed by the
derivative to move in the direction of greatest negative gradient to iteratively update a variable. That
is, given an initial point x
0
, gradient descent proposes the next point to be:
x = x
0
− 5
x
f(x) (1)
where
is the step size or learning rate . If we denote the parameters in a neural network (i.e., weights)
as
θ ∈ R
N
and
f(θ)
is the loss function, we can apply gradient descent to learn the weights of a
network; i.e., with input x, a solution y, and non-linearity σ:
y = f (θ) = σ(W
l
σ(W
l−1
σ(W
l−2
...σ(W
0
x + b
0
)... + b
l
) (2)
where W
l
∈ θ are the weights for layer l and b
l
∈ θ are biases for layer l.
The Hessian-free optimization method [Martens, 2010] suggests a second order solution that utilizes
the slope information contained in the second derivative (i.e., the derivative of the gradient
5
θ
f(θ)
).
From Martens [2010], the main idea of the second order Newton’s method is that the loss function
can be locally approximated by the quadratic as:
f(θ) ≈ f(θ
0
) + (θ − θ
0
)
T
5
θ
f(θ
0
) +
1
2
(θ − θ
0
)
T
H(θ − θ
0
) (3)
where
H
is the Hessian, or the second derivative matrix of
f(θ
0
)
. Writing Equation 1 to update the
parameters at iteration i as:
θ
i+1
= θ
i
− 5
θ
f(θ
i
) (4)
allows Equation 3 to be re-written as:
f(θ
i
− 5
θ
f(θ
i
)) ≈ f(θ
i
) + (θ
i+1
− θ
i
)
T
5
θ
f(θ
i
) +
1
2
(θ
i+1
− θ
i
)
T
H(θ
i+1
− θ
i
) (5)
In general it is not feasible to compute the Hessian matrix, which has
Ω(N
2
)
elements, where
N
is the number of parameters in the network, but it is unnecessary to compute the full Hessian. The
Hessian expresses the curvature in all directions in a high dimensional space, but the only relevant
curvature direction is in the direction of steepest descent that SGD will traverse. This concept is
contained within Hessian-free optimization, as Martens [2010] suggests a finite difference approach
for obtaining an estimate of the Hessian from two gradients:
H(θ) = lim
δ→0
5f(θ + δ) − 5f(θ)
δ
(6)
where
δ
should be in the direction of the steepest descent. The AdaSecant method [Gulcehre et al.,
2014, 2017] builds an adaptive learning rate method based on this finite difference approximation as:
∗
≈
θ
i+1
− θ
i
5f(θ
i+1
) − 5f(θ
i
)
(7)
where
∗
represents the optimal learning rate for each of the neurons. Utilizing Equation 4, we rewrite
Equation 7 in terms of the differences between the weights from three sequential iterations as:
∗
=
θ
i+1
− θ
i
2θ
i+1
− θ
i
− θ
i+2
(8)
where
on the right hand side is the learning rate value actually used in the calculations to update
the weights. Equation 8 is an expression for an adaptive learning rate for each weight update. We
borrow the method in Schaul et al. [2013] to obtain an estimate of the global learning rate from the
weight specific rates by summing over the numerator and denominator, with one minor difference. In
Schaul et al. [2013] their expression is squared, leading to positive values – therefore we sum the
4