深度残差网络：解决更深层次神经网络训练难题

需积分: 9 171 浏览量更新于2024-09-08 收藏 832KB PDF 举报

"ResNet残差网络是深度学习领域的一项重要创新，旨在解决随着神经网络层数增加而训练难度增大的问题。通过引入残差学习框架，ResNet能够更有效地优化深度网络，实现更深层次的模型，并且在保持或降低复杂性的同时提高准确性。在ImageNet数据集上，ResNet-152网络（包含152层）的表现超越了之前较浅的VGG网络，错误率低至3.57%，在ILSVRC2015分类任务中取得第一。此外，ResNet还在CIFAR-10数据集上展示了处理100层和1000层网络的能力。深度表示对于许多视觉识别任务至关重要，ResNet的深度结构不仅增加了网络的表达能力，还确保了训练的可行性。" 正文： ResNet残差网络由Kaiming He、Xiangyu Zhang、Shaoqing Ren和Jian Sun等人提出，其核心思想是解决深度神经网络（DNN）的训练难题。随着网络层数的增加，传统的DNN模型往往会遇到梯度消失或梯度爆炸的问题，导致训练困难，甚至无法收敛。ResNet通过引入残差块（Residual Block）的概念，成功地缓解了这些问题。残差块的设计允许网络学习到输入信号的“残差”或者说是“差异”，而不是直接学习复杂的非线性函数。每个残差块包含两个或三个卷积层，中间通常会插入一个批量归一化（Batch Normalization）层和ReLU激活函数。块的输入信号会与经过这些层后的输出相加，形成最终的输出。这种设计使得网络可以学习到输入信号的微小变化，而不是从零开始构建复杂的特征表示。 ResNet在ImageNet大规模视觉识别挑战赛（ILSVRC）上展示了其优势，即使达到152层的深度，依然能保持高效训练并获得优异的性能。相比VGG网络，ResNet不仅更深，而且计算复杂度更低，这得益于其残差学习框架。VGG网络虽然在当时已经非常深，但其连续的卷积层设计可能导致梯度传播的问题，而ResNet则通过残差连接解决了这一问题。此外，ResNet在CIFAR-10数据集上的实验进一步证明了其在极深网络（如100层和1000层）中的可行性。尽管如此深的网络在实际应用中可能并不常见，但这些实验结果强调了ResNet框架在理论上支持非常深的网络结构。 ResNet残差网络是深度学习领域的一个里程碑，它通过残差学习框架推动了深度神经网络的发展，使得更深层次的网络得以有效训练，并在图像识别等任务中取得了显著的性能提升。这种设计思路已经被广泛应用到许多后续的深度学习模型中，如ResNeXt、 DenseNet等，对深度学习领域产生了深远影响。

way networks have not demonstrated accuracy gains with

extremely increased depth (e.g., over 100 layers).

3. Deep Residual Learning

3.1. Residual Learning

Let us consider H(x) as an underlying mapping to be

ﬁt by a few stacked layers (not necessarily the entire net),

with x denoting the inputs to the ﬁrst of these layers. If one

hypothesizes that multiple nonlinear layers can asymptoti-

cally approximate complicated functions

, then it is equiv-

alent to hypothesize that they can asymptotically approxi-

mate the residual functions, i.e., H(x) − x (assuming that

the input and output are of the same dimensions). So

rather than expect stacked layers to approximate H(x), we

explicitly let these layers approximate a residual function

F(x) := H(x) − x. The original function thus becomes

F(x)+x. Although both forms should be able to asymptot-

ically approximate the desired functions (as hypothesized),

the ease of learning might be different.

This reformulation is motivated by the counterintuitive

phenomena about the degradation problem (Fig. 1, left). As

we discussed in the introduction, if the added layers can

be constructed as identity mappings, a deeper model should

have training error no greater than its shallower counter-

part. The degradation problem suggests that the solvers

might have difﬁculties in approximating identity mappings

by multiple nonlinear layers. With the residual learning re-

formulation, if identity mappings are optimal, the solvers

may simply drive the weights of the multiple nonlinear lay-

ers toward zero to approach identity mappings.

In real cases, it is unlikely that identity mappings are op-

timal, but our reformulation may help to precondition the

problem. If the optimal function is closer to an identity

mapping than to a zero mapping, it should be easier for the

solver to ﬁnd the perturbations with reference to an identity

mapping, than to learn the function as a new one. We show

by experiments (Fig. 7) that the learned residual functions in

general have small responses, suggesting that identity map-

pings provide reasonable preconditioning.

3.2. Identity Mapping by Shortcuts

We adopt residual learning to every few stacked layers.

A building block is shown in Fig. 2. Formally, in this paper

we consider a building block deﬁned as:

y = F(x, {W

}) + x. (1)

Here x and y are the input and output vectors of the lay-

ers considered. The function F(x, {W

}) represents the

residual mapping to be learned. For the example in Fig. 2

that has two layers, F = W

σ(W

x) in which σ denotes

This hypothesis, however, is still an open question. See [28].

ReLU [29] and the biases are omitted for simplifying no-

tations. The operation F + x is performed by a shortcut

connection and element-wise addition. We adopt the sec-

ond nonlinearity after the addition (i.e., σ(y), see Fig. 2).

The shortcut connections in Eqn.(1) introduce neither ex-

tra parameter nor computation complexity. This is not only

attractive in practice but also important in our comparisons

between plain and residual networks. We can fairly com-

pare plain/residual networks that simultaneously have the

same number of parameters, depth, width, and computa-

tional cost (except for the negligible element-wise addition).

The dimensions of x and F must be equal in Eqn.(1).

If this is not the case (e.g., when changing the input/output

channels), we can perform a linear projection W

by the

shortcut connections to match the dimensions:

y = F(x, {W

}) + W

x. (2)

We can also use a square matrix W

in Eqn.(1). But we will

show by experiments that the identity mapping is sufﬁcient

for addressing the degradation problem and is economical,

and thus W

is only used when matching dimensions.

The form of the residual function F is ﬂexible. Exper-

iments in this paper involve a function F that has two or

three layers (Fig. 5), while more layers are possible. But if

F has only a single layer, Eqn.(1) is similar to a linear layer:

y = W

x + x, for which we have not observed advantages.

We also note that although the above notations are about

fully-connected layers for simplicity, they are applicable to

convolutional layers. The function F(x, {W

}) can repre-

sent multiple convolutional layers. The element-wise addi-

tion is performed on two feature maps, channel by channel.

3.3. Network Architectures

We have tested various plain/residual nets, and have ob-

served consistent phenomena. To provide instances for dis-

cussion, we describe two models for ImageNet as follows.

Plain Network. Our plain baselines (Fig. 3, middle) are

mainly inspired by the philosophy of VGG nets [41] (Fig. 3,

left). The convolutional layers mostly have 3×3 ﬁlters and

follow two simple design rules: (i) for the same output

feature map size, the layers have the same number of ﬁl-

ters; and (ii) if the feature map size is halved, the num-

ber of ﬁlters is doubled so as to preserve the time com-

plexity per layer. We perform downsampling directly by

convolutional layers that have a stride of 2. The network

ends with a global average pooling layer and a 1000-way

fully-connected layer with softmax. The total number of

weighted layers is 34 in Fig. 3 (middle).

It is worth noticing that our model has fewer ﬁlters and

lower complexity than VGG nets [41] (Fig. 3, left). Our 34-

layer baseline has 3.6 billion FLOPs (multiply-adds), which

is only 18% of VGG-19 (19.6 billion FLOPs).

此外，高速网络还没有证明其准确性的提高是由于深度的极度增加

高速网络作为深度残余网络的对比

残余学习

让我们把H(x)看作是一个拟合一些堆叠层的底层映射（不一定是全部网络）

x表示第一个层的输入

如果做一个假设，多个非线性层可以渐近逼近复杂函数

然后等价于假设它们可以渐近逼近残差函数，H(X)-X

所以我们不期望堆叠层近似H(x),而是显式地让这些层近似于一个残差函数F (x): = H (x)−x。

虽然两种形式都应该能够渐近地近似所需的函数(如假设的那样)，

学习的难易程度可能有所不同。

学习的公式：F（x）=H(X)-X

这种重新表述的动机是关于退化问题违反直觉的现象

退化问题表明求解器可能在通过多个非线性层来近似恒等映射方面有困难。

利用残差学习重构，如果恒等映射是最优的，

实际情况可能不能使恒等映射达到最优，但我们的重构有助于预处理这个问题。

如果最优函数更趋近于恒等映射而不是0映射，

那么对于求解器来说寻找关于恒等映射的扰动比学习一个新的函数要容易的多。

则求解器可以简单地将多个非线性层的权值推到零以接近恒等映射。

介绍恒等映射

实验表明恒等映射提供了合理的预处理

通过shortcat实现恒等映射

我们在堆叠层上采取残差学习算法

这里x和y是被考虑的层的输入和输出向量

函数F(x,{Wi})代表着学到的残差映射

偏差

其中σ代表ReLU，为了简化省略了偏置项

F+x 操作由一个shortcut连接和元素级（element-wise）的加法来表示

在加法之后我们再执行另一个非线性操作(例如, σ(y))

shaortcat的重要性

线性投影

x和F的维度匹配

方阵

在实验中证明线性投影作为维度匹配的方法是最好的。

本文实验中涉及到的函数F是两层或者三层的

但是如果F只含有一层，就和线性函数：y=W1x+x一样，并不具有任何优势。

F的形式是灵活的

全连接层

卷积层

函数F(x,{Wi})可以表示多个卷积层

元素级的加法在两个feature map上一个通道一个通道地执行

同样适合于卷积层

网络结构

我们将ImageNet的两个模型描述如下。

普通网络

普通网络的基线主要受VGG网络的理念启发

卷积层主要有3×3过滤器

遵循两个简单的设计规则

输出特征尺寸相同的层含有相同数量的滤波器

如果feature map的大小减半，过滤器的数量就会增加一倍，以保持每层的时间复杂度相同

我们直接通过stride 为2的卷积层来进行下采样

在网络的最后是一个全局的平均pooling层和一个1000 类的包含softmax的全连接层。

加权层的层数为34

普通网络的设计

值得注意的是，我们的模型比VGG网络有更少的滤波器和更低的计算复杂度。

我们34层的基线结构含有36亿个FLOPs（乘-加），而这仅仅只有VGG-19 （196亿个FLOPs）的18%。

剩余11页未读，继续阅读

zhuzhuxiassky

粉丝: 0
资源: 1

深度残差网络：解决更深层次神经网络训练难题

残差网络ResNet模块

python基于paddle框架+ResNet残差网络的蝴蝶种类识别和分类项目源码（高分项目）.zip

Resnet_resnet残差网络_

resnet残差网络

ResNet残差网络论文

ResNet网络与ResNet残差网络有区别吗

resnet残差网络介绍

python项目源码-paddle框架+ResNet残差网络的蝴蝶种类识别和分类项目源码（高分课程设计）.rar

Python Grad-CAM通道注意力机制 ResNet残差网络 图像天气分类 绘制注意力图热图 torch 有图像天气数据集

深入探究ResNet残差网络模型在数据分类中的应用

最新资源

Python Grad-CAM通道注意力机制 ResNet残差网络图像天气分类绘制注意力图热图 torch 有图像天气数据集