3
Units”. The original Residual Unit in [1] performs the following computation:
y
l
= h(x
l
) + F(x
l
, W
l
), (1)
x
l+1
= f(y
l
). (2)
Here x
l
is the input feature to the l-th Residual Unit. W
l
= {W
l,k
|
1≤k≤K
} is a
set of weights (and biases) associated with the l-th Residual Unit, and K is the
number of layers in a Residual Unit (K is 2 or 3 in [1]). F denotes the residual
function, e.g., a stack of two 3×3 convolutional layers in [1]. The function f is
the operation after element-wise addition, and in [1] f is ReLU. The function h
is set as an identity mapping: h(x
l
) = x
l
.
1
If f is also an identity mapping: x
l+1
≡ y
l
, we can put Eqn.(2) into Eqn.(1)
and obtain:
x
l+1
= x
l
+ F(x
l
, W
l
). (3)
Recursively (x
l+2
= x
l+1
+ F (x
l+1
, W
l+1
) = x
l
+ F (x
l
, W
l
) + F(x
l+1
, W
l+1
), etc.) we
will have:
x
L
= x
l
+
L−1
X
i=l
F(x
i
, W
i
), (4)
for any deeper unit L and any shallower unit l. Eqn.(4) exhibits some nice
properties. (i) The feature x
L
of any deeper unit L can be represented as the
feature x
l
of any shallower unit l plus a residual function in a form of
P
L−1
i=l
F,
indicating that the model is in a residual fashion between any units L and l. (ii)
The feature x
L
= x
0
+
P
L−1
i=0
F(x
i
, W
i
), of any deep unit L, is the summation
of the outputs of all preceding residual functions (plus x
0
). This is in contrast to
a “plain network” where a feature x
L
is a series of matrix-vector products, say,
Q
L−1
i=0
W
i
x
0
(ignoring BN and ReLU).
Eqn.(4) also leads to nice backward propagation properties. Denoting the
loss function as E, from the chain rule of backpropagation [9] we have:
∂E
∂x
l
=
∂E
∂x
L
∂x
L
∂x
l
=
∂E
∂x
L
1 +
∂
∂x
l
L−1
X
i=l
F(x
i
, W
i
)
!
. (5)
Eqn.(5) indicates that the gradient
∂E
∂x
l
can be decomposed into two additive
terms: a term of
∂E
∂x
L
that propagates information directly without concern-
ing any weight layers, and another term of
∂E
∂x
L
∂
∂x
l
P
L−1
i=l
F
that propagates
through the weight layers. The additive term of
∂E
∂x
L
ensures that information is
directly propagated back to any shallower unit l. Eqn.(5) also suggests that it
1
It is noteworthy that there are Residual Units for increasing dimensions and reducing
feature map sizes [1] in which h is not identity. In this case the following derivations
do not hold strictly. But as there are only a very few such units (two on CIFAR and
three on ImageNet, depending on image sizes [1]), we expect that they do not have
the exponential impact as we present in Sec. 3. One may also think of our derivations
as applied to all Residual Units within the same feature map size.