Published as a conference paper at ICLR 2017
3 STOCHASTIC GRADIENT DESCENT WITH WARM RESTARTS (SGDR)
The existing restart techniques can also be used for stochastic gradient descent if the stochasticity
is taken into account. Since gradients and loss values can vary widely from one batch of the data
to another, one should denoise the incoming information: by considering averaged gradients and
losses, e.g., once per epoch, the above-mentioned restart techniques can be used again.
In this work, we consider one of the simplest warm restart approaches. We simulate a new warm-
started run / restart of SGD once T
i
epochs are performed, where i is the index of the run. Impor-
tantly, the restarts are not performed from scratch but emulated by increasing the learning rate η
t
while the old value of x
t
is used as an initial solution. The amount of this increase controls to which
extent the previously acquired information (e.g., momentum) is used.
Within the i-th run, we decay the learning rate with a cosine annealing for each batch as follows:
η
t
= η
i
min
+
1
2
(η
i
max
− η
i
min
)(1 + cos(
T
cur
T
i
π)), (5)
where η
i
min
and η
i
max
are ranges for the learning rate, and T
cur
accounts for how many epochs
have been performed since the last restart. Since T
cur
is updated at each batch iteration t, it can
take discredited values such as 0.1, 0.2, etc. Thus, η
t
= η
i
max
when t = 0 and T
cur
= 0. Once
T
cur
= T
i
, the cos function will output −1 and thus η
t
= η
i
min
. The decrease of the learning rate
is shown in Figure 1 for fixed T
i
= 50, T
i
= 100 and T
i
= 200; note that the logarithmic axis
obfuscates the typical shape of the cosine function.
In order to improve anytime performance, we suggest an option to start with an initially small T
i
and increase it by a factor of T
mult
at every restart (see, e.g., Figure 1 for T
0
= 1, T
mult
= 2 and
T
0
= 10, T
mult
= 2). It might be of great interest to decrease η
i
max
and η
i
min
at every new restart.
However, for the sake of simplicity, here, we keep η
i
max
and η
i
min
the same for every i to reduce the
number of hyperparameters involved.
Since our simulated warm restarts (the increase of the learning rate) often temporarily worsen per-
formance, we do not always use the last x
t
as our recommendation for the best solution (also called
the incumbent solution). While our recommendation during the first run (before the first restart) is
indeed the last x
t
, our recommendation after this is a solution obtained at the end of the last per-
formed run at η
t
= η
i
min
. We emphasize that with the help of this strategy, our method does not
require a separate validation data set to determine a recommendation.
4 EXPERIMENTAL RESULTS
4.1 EXPERIMENTAL SETTINGS
We consider the problem of training Wide Residual Neural Networks (WRNs; see Zagoruyko &
Komodakis (2016) for details) on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). We
will use the abbreviation WRN-d-k to denote a WRN with depth d and width k. Zagoruyko &
Komodakis (2016) obtained the best results with a WRN-28-10 architecture, i.e., a Residual Neural
Network with d = 28 layers and k = 10 times more filters per layer than used in the original
Residual Neural Networks (He et al., 2015; 2016).
The CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009) consist of 32×32 color images drawn
from 10 and 100 classes, respectively, split into 50,000 train and 10,000 test images. For image
preprocessing Zagoruyko & Komodakis (2016) performed global contrast normalization and ZCA
whitening. For data augmentation they performed horizontal flips and random crops from the image
padded by 4 pixels on each side, filling missing pixels with reflections of the original image.
For training, Zagoruyko & Komodakis (2016) used SGD with Nesterov’s momentum with initial
learning rate set to η
0
= 0.1, weight decay to 0.0005, dampening to 0, momentum to 0.9 and
minibatch size to 128. The learning rate is dropped by a factor of 0.2 at 60, 120 and 160 epochs,
with a total budget of 200 epochs. We reproduce the results of Zagoruyko & Komodakis (2016) with
the same settings except that i) we subtract per-pixel mean only and do not use ZCA whitening; ii)
we use SGD with momentum as described by eq. (3-4) and not Nesterov’s momentum.
4