C余弦退火SGDR：深度学习训练的强大优化策略

需积分: 0 112 浏览量更新于2024-07-01 收藏 1.47MB PDF 举报

C 余弦退火热启动1（SGDR：Stochastic Gradient Descent with Warm Restarts）是一篇于2017年发表在国际计算机视觉与模式识别会议（ICLR）上的研究论文，由 Ilya Loshchilov 和 Frank Hutter 两位来自德国弗莱堡大学的研究者共同撰写。该研究专注于改进深度神经网络（DNNs）训练时的性能，特别是在处理多模态函数和条件较差的任务上。论文的核心贡献是提出了一种简单的温重启（Warm Restart）策略，它是对传统随机梯度下降（SGD）算法的一种扩展。温重启技术通常在无梯度优化中被用于应对多模态问题，而在本研究中，它被引入到有监督的SGD中，旨在提升模型在训练过程中的任意时间性能。作者特别关注的是在大规模视觉任务，如CIFAR-10和CIFAR-100数据集上的应用，这两个数据集分别展示了新记录的性能，达到了3.14%和16.21%的误差率，这在当时是非常先进的结果。除了在图像分类任务上的成功，论文还展示了这种方法在其他领域，如处理电子脑电图（EEG）数据和对ImageNet数据集进行下采样后的应用。实验结果显示，这种策略不仅在深度学习模型训练中表现出色，还能适应不同类型的输入数据和复杂性，进一步证实了其广泛适用性。该研究的意义在于，它提供了一种简单但有效的手段来增强SGD的稳定性和效率，这对于训练深层神经网络来说是至关重要的。为了便于他人复制研究并进行进一步探索，论文作者将他们的源代码开源，可供公众在GitHub上获取：<https://github.com/loshchil/SGDR>。这个项目为后续研究者在优化深度学习训练策略方面提供了宝贵资源和参考。

Published as a conference paper at ICLR 2017

3 STOCHASTIC GRADIENT DESCENT WITH WARM RESTARTS (SGDR)

The existing restart techniques can also be used for stochastic gradient descent if the stochasticity

is taken into account. Since gradients and loss values can vary widely from one batch of the data

to another, one should denoise the incoming information: by considering averaged gradients and

losses, e.g., once per epoch, the above-mentioned restart techniques can be used again.

In this work, we consider one of the simplest warm restart approaches. We simulate a new warm-

started run / restart of SGD once T

epochs are performed, where i is the index of the run. Impor-

tantly, the restarts are not performed from scratch but emulated by increasing the learning rate η

while the old value of x

is used as an initial solution. The amount of this increase controls to which

extent the previously acquired information (e.g., momentum) is used.

Within the i-th run, we decay the learning rate with a cosine annealing for each batch as follows:

= η

min

(η

max

− η

min

)(1 + cos(

cur

π)), (5)

where η

min

and η

max

are ranges for the learning rate, and T

cur

accounts for how many epochs

have been performed since the last restart. Since T

cur

is updated at each batch iteration t, it can

take discredited values such as 0.1, 0.2, etc. Thus, η

= η

max

when t = 0 and T

cur

= 0. Once

cur

= T

, the cos function will output −1 and thus η

= η

min

. The decrease of the learning rate

is shown in Figure 1 for ﬁxed T

= 50, T

= 100 and T

= 200; note that the logarithmic axis

obfuscates the typical shape of the cosine function.

In order to improve anytime performance, we suggest an option to start with an initially small T

and increase it by a factor of T

mult

at every restart (see, e.g., Figure 1 for T

= 1, T

mult

= 2 and

= 10, T

mult

= 2). It might be of great interest to decrease η

max

and η

min

at every new restart.

However, for the sake of simplicity, here, we keep η

max

and η

min

the same for every i to reduce the

number of hyperparameters involved.

Since our simulated warm restarts (the increase of the learning rate) often temporarily worsen per-

formance, we do not always use the last x

as our recommendation for the best solution (also called

the incumbent solution). While our recommendation during the ﬁrst run (before the ﬁrst restart) is

indeed the last x

, our recommendation after this is a solution obtained at the end of the last per-

formed run at η

= η

min

. We emphasize that with the help of this strategy, our method does not

require a separate validation data set to determine a recommendation.

4 EXPERIMENTAL RESULTS

4.1 EXPERIMENTAL SETTINGS

We consider the problem of training Wide Residual Neural Networks (WRNs; see Zagoruyko &

Komodakis (2016) for details) on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). We

will use the abbreviation WRN-d-k to denote a WRN with depth d and width k. Zagoruyko &

Komodakis (2016) obtained the best results with a WRN-28-10 architecture, i.e., a Residual Neural

Network with d = 28 layers and k = 10 times more ﬁlters per layer than used in the original

Residual Neural Networks (He et al., 2015; 2016).

The CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009) consist of 32×32 color images drawn

from 10 and 100 classes, respectively, split into 50,000 train and 10,000 test images. For image

preprocessing Zagoruyko & Komodakis (2016) performed global contrast normalization and ZCA

whitening. For data augmentation they performed horizontal ﬂips and random crops from the image

padded by 4 pixels on each side, ﬁlling missing pixels with reﬂections of the original image.

For training, Zagoruyko & Komodakis (2016) used SGD with Nesterov’s momentum with initial

learning rate set to η

= 0.1, weight decay to 0.0005, dampening to 0, momentum to 0.9 and

minibatch size to 128. The learning rate is dropped by a factor of 0.2 at 60, 120 and 160 epochs,

with a total budget of 200 epochs. We reproduce the results of Zagoruyko & Komodakis (2016) with

the same settings except that i) we subtract per-pixel mean only and do not use ZCA whitening; ii)

we use SGD with momentum as described by eq. (3-4) and not Nesterov’s momentum.

剩余15页未读，继续阅读

赶路的稻草人

粉丝: 31
资源: 330

C余弦退火SGDR：深度学习训练的强大优化策略

c语言编写的余弦曲线

C语言绘制余弦、正弦曲线

C语言正余弦函数分段查表计算器

C语言绘制余弦及直线曲线算法

c语言求余弦相似度代码

C语言中求余弦值的相关函数总结

C语言绘制余弦曲线与直线程序示例

C语言绘制余弦曲线：从基础到实践

C语言实现余弦曲线绘制与直线迭加算法示例

初学者指南：C语言绘制余弦曲线与直线示例

最新资源