way networks have not demonstrated accuracy gains with
extremely increased depth (e.g., over 100 layers).
3. Deep Residual Learning
3.1. Residual Learning
Let us consider H(x) as an underlying mapping to be
fit by a few stacked layers (not necessarily the entire net),
with x denoting the inputs to the first of these layers. If one
hypothesizes that multiple nonlinear layers can asymptoti-
cally approximate complicated functions
2
, then it is equiv-
alent to hypothesize that they can asymptotically approxi-
mate the residual functions, i.e., H(x) − x (assuming that
the input and output are of the same dimensions). So
rather than expect stacked layers to approximate H(x), we
explicitly let these layers approximate a residual function
F(x) := H(x) − x. The original function thus becomes
F(x)+x. Although both forms should be able to asymptot-
ically approximate the desired functions (as hypothesized),
the ease of learning might be different.
This reformulation is motivated by the counterintuitive
phenomena about the degradation problem (Fig. 1, left). As
we discussed in the introduction, if the added layers can
be constructed as identity mappings, a deeper model should
have training error no greater than its shallower counter-
part. The degradation problem suggests that the solvers
might have difficulties in approximating identity mappings
by multiple nonlinear layers. With the residual learning re-
formulation, if identity mappings are optimal, the solvers
may simply drive the weights of the multiple nonlinear lay-
ers toward zero to approach identity mappings.
In real cases, it is unlikely that identity mappings are op-
timal, but our reformulation may help to precondition the
problem. If the optimal function is closer to an identity
mapping than to a zero mapping, it should be easier for the
solver to find the perturbations with reference to an identity
mapping, than to learn the function as a new one. We show
by experiments (Fig. 7) that the learned residual functions in
general have small responses, suggesting that identity map-
pings provide reasonable preconditioning.
3.2. Identity Mapping by Shortcuts
We adopt residual learning to every few stacked layers.
A building block is shown in Fig. 2. Formally, in this paper
we consider a building block defined as:
y = F(x, {W
i
}) + x. (1)
Here x and y are the input and output vectors of the lay-
ers considered. The function F(x, {W
i
}) represents the
residual mapping to be learned. For the example in Fig. 2
that has two layers, F = W
2
σ(W
1
x) in which σ denotes
2
This hypothesis, however, is still an open question. See [28].
ReLU [29] and the biases are omitted for simplifying no-
tations. The operation F + x is performed by a shortcut
connection and element-wise addition. We adopt the sec-
ond nonlinearity after the addition (i.e., σ(y), see Fig. 2).
The shortcut connections in Eqn.(1) introduce neither ex-
tra parameter nor computation complexity. This is not only
attractive in practice but also important in our comparisons
between plain and residual networks. We can fairly com-
pare plain/residual networks that simultaneously have the
same number of parameters, depth, width, and computa-
tional cost (except for the negligible element-wise addition).
The dimensions of x and F must be equal in Eqn.(1).
If this is not the case (e.g., when changing the input/output
channels), we can perform a linear projection W
s
by the
shortcut connections to match the dimensions:
y = F(x, {W
i
}) + W
s
x. (2)
We can also use a square matrix W
s
in Eqn.(1). But we will
show by experiments that the identity mapping is sufficient
for addressing the degradation problem and is economical,
and thus W
s
is only used when matching dimensions.
The form of the residual function F is flexible. Exper-
iments in this paper involve a function F that has two or
three layers (Fig. 5), while more layers are possible. But if
F has only a single layer, Eqn.(1) is similar to a linear layer:
y = W
1
x + x, for which we have not observed advantages.
We also note that although the above notations are about
fully-connected layers for simplicity, they are applicable to
convolutional layers. The function F(x, {W
i
}) can repre-
sent multiple convolutional layers. The element-wise addi-
tion is performed on two feature maps, channel by channel.
3.3. Network Architectures
We have tested various plain/residual nets, and have ob-
served consistent phenomena. To provide instances for dis-
cussion, we describe two models for ImageNet as follows.
Plain Network. Our plain baselines (Fig. 3, middle) are
mainly inspired by the philosophy of VGG nets [41] (Fig. 3,
left). The convolutional layers mostly have 3×3 filters and
follow two simple design rules: (i) for the same output
feature map size, the layers have the same number of fil-
ters; and (ii) if the feature map size is halved, the num-
ber of filters is doubled so as to preserve the time com-
plexity per layer. We perform downsampling directly by
convolutional layers that have a stride of 2. The network
ends with a global average pooling layer and a 1000-way
fully-connected layer with softmax. The total number of
weighted layers is 34 in Fig. 3 (middle).
It is worth noticing that our model has fewer filters and
lower complexity than VGG nets [41] (Fig. 3, left). Our 34-
layer baseline has 3.6 billion FLOPs (multiply-adds), which
is only 18% of VGG-19 (19.6 billion FLOPs).
3