
SERGEY ZAGORUYKO AND NIKOS KOMODAKIS: WIDE RESIDUAL NETWORKS 3
thus seem to indicate that the main power of deep residual networks is in residual blocks, and
that the effect of depth is supplementary. We note that one can train even better wide resid-
ual networks that have twice as many parameters (and more), which suggests that to further
improve performance by increasing depth of thin networks one needs to add thousands of
layers in this case.
Use of dropout in ResNet blocks. Dropout was first introduced in [27] and then was
adopted by many successful architectures as [16, 26] etc. It was mostly applied on top layers
that had a large number of parameters to prevent feature coadaptation and overfitting. It was
then mainly substituted by batch normalization [15] which was introduced as a technique to
reduce internal covariate shift in neural network activations by normalizing them to have spe-
cific distribution. It also works as a regularizer and the authors experimentally showed that a
network with batch normalization achieves better accuracy than a network with dropout. In
our case, as widening of residual blocks results in an increase of the number of parameters,
we studied the effect of dropout to regularize training and prevent overfitting. Previously,
dropout in residual networks was studied in [13] with dropout being inserted in the identity
part of the block, and the authors showed negative effects of that. Instead, we argue here
that dropout should be inserted between convolutional layers. Experimental results on wide
residual networks show that this leads to consistent gains, yielding even new state-of-the-
art results (e.g., 16-layer-deep wide residual network with dropout achieves 1.64% error on
SVHN).
In summary, the contributions of this work are as follows:
• We present a detailed experimental study of residual network architectures that thor-
oughly examines several important aspects of ResNet block structure.
• We propose a novel widened architecture for ResNet blocks that allows for residual
networks with significantly improved performance.
• We propose a new way of utilizing dropout within deep residual networks so as to
properly regularize them and prevent overfitting during training.
• Last, we show that our proposed ResNet architectures achieve state-of-the-art results
on several datasets dramatically improving accuracy and speed of residual networks.
2 Wide residual networks
Residual block with identity mapping can be represented by the following formula:
x
l+1
= x
l
+ F (x
l
, W
l
) (1)
where x
l+1
and x
l
are input and output of the l-th unit in the network, F is a residual func-
tion and W
l
are parameters of the block. Residual network consists of sequentially stacked
residual blocks.
In [13] residual networks consisted of two type of blocks:
• basic - with two consecutive 3 × 3 convolutions with batch normalization and ReLU
preceding convolution: conv3 × 3-conv3 × 3 Fig.1(a)
• bottleneck - with one 3 × 3 convolution surrounded by dimensionality reducing and
expanding 1 × 1 convolution layers: conv1 × 1-conv3 × 3-conv1 × 1 Fig.1(b)