Shake-Shake正则化：缓解深度学习过拟合的新策略

需积分: 50 93 浏览量更新于2024-09-03 收藏 1.3MB PDF 举报

"Shake-Shake Regularization：一种用于深度学习的正则化方法，通过在多分支网络中引入随机仿射组合替代标准分支求和，有效缓解过拟合问题。在3分支残差网络上应用该方法，CIFAR-10和CIFAR-100的测试错误率分别降低至2.86%和15.85%，显示出显著的性能提升。此外，即使在无跳过连接和批量归一化的情况下，实验结果仍令人鼓舞，为各种应用提供了新的可能。代码已在GitHub上公开，可进一步研究和实践。" 深度学习是一种强大的机器学习技术，它通过构建多层神经网络来学习复杂的数据表示。然而，随着网络深度的增加，过拟合问题往往变得更加严重，即模型在训练数据上表现优秀，但在未见过的测试数据上表现较差。过拟合是由于模型过度适应训练数据，无法泛化到新数据。 Shake-Shake Regularization是由Xavier Gastaldi提出的，针对深度学习中的过拟合问题提供了一种创新的解决方案。传统的多分支网络中，不同分支的输出通常是简单相加。而Shake-Shake方法则提出使用随机仿射组合，即将这些分支的输出通过随机权重进行加权平均，这增加了网络的不稳定性，有助于防止模型对特定特征的过度依赖，从而降低过拟合的风险。 Shake-Shake Regularization被应用于3分支残差网络中，残差网络（Residual Networks）是深度学习领域的一个重要突破，通过引入跳跃连接使得信息可以直接跨层传递，解决了梯度消失问题。在CIFAR-10和CIFAR-100这两个图像分类任务上，应用Shake-Shake后，测试错误率的显著降低证明了其有效性。CIFAR-10包含10类，每个类别有6000张32x32像素的彩色图像，而CIFAR-100则有100类，每类同样有6000张图像，这两个数据集广泛用于评估深度学习模型的性能。在实验中，即使在没有跳过连接和批量归一化（Batch Normalization）的网络结构上，Shake-Shake也展示了令人鼓舞的结果。批量归一化通常用于加速网络训练并提高模型的稳定性和准确性，但Shake-Shake的性能表明，即使在没有这一常用组件的情况下，该正则化方法也能发挥作用。这为那些不适合批量归一化或无法使用批量归一化的应用场景提供了新的选择。 Shake-Shake Regularization是一种有效的正则化技术，通过引入随机性来增强网络的泛化能力，尤其在深度学习网络中，可以显著提高模型在小样本数据集上的表现。这种技术的开源代码使得研究者和实践者能够更深入地探索和利用它，有望推动更多深度学习应用的发展。

展开

Shake-Shake regularization

Xavier Gastaldi

xgastaldi.mba2011@london.edu

Abstract

The method introduced in this paper aims at helping deep learning practition-

ers faced with an overﬁt problem. The idea is to replace, in a multi-branch

network, the standard summation of parallel branches with a stochastic afﬁne

combination. Applied to 3-branch residual networks, shake-shake regularization

improves on the best single shot published results on CIFAR-10 and CIFAR-

100 by reaching test errors of 2.86% and 15.85%. Experiments on architec-

tures without skip connections or Batch Normalization show encouraging re-

sults and open the door to a large set of applications. Code is available at

https://github.com/xgastaldi/shake-shake.

1 Introduction

Deep residual nets (He et al., 2016a) were ﬁrst introduced in the ILSVRC & COCO 2015 competitions

(Russakovsky et al., 2015; Lin et al., 2014), where they won the 1st places on the tasks of ImageNet

detection, ImageNet localization, COCO detection, and COCO segmentation. Since then, signiﬁcant

effort has been put into trying to improve their performance. Scientists have investigated the impact

of pushing depth (He et al., 2016b; Huang et al., 2016a), width (Zagoruyko & Komodakis, 2016) and

cardinality (Xie et al., 2016; Szegedy et al., 2016; Abdi & Nahavandi, 2016).

While residual networks are powerful models, they still overﬁt on small datasets. A large number of

techniques have been proposed to tackle this problem, including weight decay (Nowlan & Hinton,

1992), early stopping, and dropout (Srivastava et al., 2014). While not directly presented as a

regularization method, Batch Normalization (Ioffe & Szegedy, 2015) regularizes the network by

computing statistics that ﬂuctuate with each mini-batch. Similarly, Stochastic Gradient Descent

(SGD) (Bottou, 1998; Sutskever et al., 2013) can also be interpreted as Gradient Descent using noisy

gradients and the generalization performance of neural networks often depends on the size of the

mini-batch (see Keskar et al. (2017)).

Pre-2015, most computer vision classiﬁcation architectures used dropout to combat overﬁt but the

introduction of Batch Normalization reduced its effectiveness (see Ioffe & Szegedy (2015); Zagoruyko

& Komodakis (2016); Huang et al. (2016b)). Searching for other regularization methods, researchers

started to look at the possibilities speciﬁcally offered by multi-branch networks. Some of them

noticed that, given the right conditions, it was possible to randomly drop some of the information

paths during training (Huang et al., 2016b; Larsson et al., 2016).

Like these last 2 works, the method proposed in this document aims at improving the generalization

ability of multi-branch networks by replacing the standard summation of parallel branches with a

stochastic afﬁne combination.

1.1 Motivation

Data augmentation techniques have traditionally been applied to input images only. However, for a

computer, there is no real difference between an input image and an intermediate representation. As a

consequence, it might be possible to apply data augmentation techniques to internal representations.

arXiv:1705.07485v2 [cs.LG] 23 May 2017

下载后可阅读完整内容，剩余9页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

123123123LWQ

粉丝: 0

Shake-Shake正则化：缓解深度学习过拟合的新策略

"基于补丁低秩正则化的图像恢复技术及实验比较分析

IRtools: MATLAB工具箱解决大规模不适定问题

MATLAB实现熵最优传输Sinkhorn-Newton方法代码解读

hierarchical-group-sparse-regularization-master.zip

工程数据分析方法16-regularization.pdf

Adversarial_Robustness_Toolbox-0.6.0-py3-none-any.whl.zip

adversarial_robustness_toolbox-1.10.2-py3-none-any.whl.zip

lesson33-regularization.zip

机器学习基石14 - 2 - Weight Decay Regularization (24-08).mp4

eth-47737-02.pdf

最新资源