way networks have not demonstrated accuracy gains with
extremely increased depth (e.g., over 100 layers).
3. Deep Residual Learning
3.1. Residual Learning
Let us consider H(x) as an underlying mapping to be
fit by a few stacked layers (not necessarily the entire net),
with x denoting the inputs to the first of these layers. If one
hypothesizes that multiple nonlinear layers can asymptoti-
cally approximate complicated functions
2
, then it is equiv-
alent to hypothesize that they can asymptotically approxi-
mate the residual functions, i.e., H(x) − x (assuming that
the input and output are of the same dimensions). So
rather than expect stacked layers to approximate H(x), we
explicitly let these layers approximate a residual function
F(x) := H(x) − x. The original function thus becomes
F(x)+x. Although both forms should be able to asymptot-
ically approximate the desired functions (as hypothesized),
the ease of learning might be different.
This reformulation is motivated by the counterintuitive
phenomena about the degradation problem (Fig. 1, left). As
we discussed in the introduction, if the added layers can
be constructed as identity mappings, a deeper model should
have training error no greater than its shallower counter-
part. The degradation problem suggests that the solvers
might have difficulties in approximating identity mappings
by multiple nonlinear layers. With the residual learning re-
formulation, if identity mappings are optimal, the solvers
may simply drive the weights of the multiple nonlinear lay-
ers toward zero to approach identity mappings.
In real cases, it is unlikely that identity mappings are op-
timal, but our reformulation may help to precondition the
problem. If the optimal function is closer to an identity
mapping than to a zero mapping, it should be easier for the
solver to find the perturbations with reference to an identity
mapping, than to learn the function as a new one. We show
by experiments (Fig. 7) that the learned residual functions in
general have small responses, suggesting that identity map-
pings provide reasonable preconditioning.
3.2. Identity Mapping by Shortcuts
We adopt residual learning to every few stacked layers.
A building block is shown in Fig. 2. Formally, in this paper
we consider a building block defined as:
y = F(x, {W
i
}) + x. (1)
Here x and y are the input and output vectors of the lay-
ers considered. The function F(x, {W
i
}) represents the
residual mapping to be learned. For the example in Fig. 2
that has two layers, F = W
2
σ(W
1
x) in which σ denotes
2
This hypothesis, however, is still an open question. See [28].
ReLU [29] and the biases are omitted for simplifying no-
tations. The operation F + x is performed by a shortcut
connection and element-wise addition. We adopt the sec-
ond nonlinearity after the addition (i.e., σ(y), see Fig. 2).
The shortcut connections in Eqn.(1) introduce neither ex-
tra parameter nor computation complexity. This is not only
attractive in practice but also important in our comparisons
between plain and residual networks. We can fairly com-
pare plain/residual networks that simultaneously have the
same number of parameters, depth, width, and computa-
tional cost (except for the negligible element-wise addition).
The dimensions of x and F must be equal in Eqn.(1).
If this is not the case (e.g., when changing the input/output
channels), we can perform a linear projection W
s
by the
shortcut connections to match the dimensions:
y = F(x, {W
i
}) + W
s
x. (2)
We can also use a square matrix W
s
in Eqn.(1). But we will
show by experiments that the identity mapping is sufficient
for addressing the degradation problem and is economical,
and thus W
s
is only used when matching dimensions.
The form of the residual function F is flexible. Exper-
iments in this paper involve a function F that has two or
three layers (Fig. 5), while more layers are possible. But if
F has only a single layer, Eqn.(1) is similar to a linear layer:
y = W
1
x + x, for which we have not observed advantages.
We also note that although the above notations are about
fully-connected layers for simplicity, they are applicable to
convolutional layers. The function F(x, {W
i
}) can repre-
sent multiple convolutional layers. The element-wise addi-
tion is performed on two feature maps, channel by channel.
3.3. Network Architectures
We have tested various plain/residual nets, and have ob-
served consistent phenomena. To provide instances for dis-
cussion, we describe two models for ImageNet as follows.
Plain Network. Our plain baselines (Fig. 3, middle) are
mainly inspired by the philosophy of VGG nets [41] (Fig. 3,
left). The convolutional layers mostly have 3×3 filters and
follow two simple design rules: (i) for the same output
feature map size, the layers have the same number of fil-
ters; and (ii) if the feature map size is halved, the num-
ber of filters is doubled so as to preserve the time com-
plexity per layer. We perform downsampling directly by
convolutional layers that have a stride of 2. The network
ends with a global average pooling layer and a 1000-way
fully-connected layer with softmax. The total number of
weighted layers is 34 in Fig. 3 (middle).
It is worth noticing that our model has fewer filters and
lower complexity than VGG nets [41] (Fig. 3, left). Our 34-
layer baseline has 3.6 billion FLOPs (multiply-adds), which
is only 18% of VGG-19 (19.6 billion FLOPs).
3
此外,高速网络还没有证明其准确性的提高是由于深度的极度增加
让我们把H(x)看作是一个拟合一些堆叠层的底层映射(不一定是全部网络)
然后等价于假设它们可以渐近逼近残差函数,H(X)-X
所以我们不期望堆叠层近似H(x),而是显式地让这些层近似于一个残差函数F (x): = H (x)−x。
虽然两种形式都应该能够渐近地近似所需的函数(如假设的那样),
退化问题表明求解器可能在通过多个非线性层来近似恒等映射方面有困难。
实际情况可能不能使恒等映射达到最优,但我们的重构有助于预处理这个问题。
那么对于求解器来说寻找关于恒等映射的扰动比学习一个新的函数要容易的多。
则求解器可以简单地将多个非线性层的权值推到零以接近恒等映射。
F+x 操作由一个shortcut连接和元素级(element-wise)的加法来表示
在加法之后我们再执行另一个非线性操作(例如, σ(y))
但是如果F只含有一层,就和线性函数:y=W1x+x一样,并不具有任何优势。
元素级的加法在两个feature map上一个通道一个通道地执行
如果feature map的大小减半,过滤器的数量就会增加一倍,以保持每层的时间复杂度相同
我们直接通过stride 为2的卷积层来进行下采样
在网络的最后是一个全局的平均pooling层和一个1000 类的包含softmax的全连接层。
值得注意的是,我们的模型比VGG网络有更少的滤波器和更低的计算复杂度。
我们34层的基线结构含有36亿个FLOPs(乘-加),而这仅仅只有VGG-19 (196亿个FLOPs)的18%。