Published as a conference paper at ICLR 2020
It can be seen from (5) that g
B
t
and Ψ
B
t
are also batch statistics involved in BN during BP. But they
have never been well discussed before.
3.2 INSTABILITY OF BATCH STATISTICS
According to Ioffe & Szegedy (2015), the ideal normalization is to normalize feature maps X using
expectation and variance computed over the whole training data set:
Y =
X − EX
p
V ar[X ]
. (6)
But it’s impractical when using stochastic optimization. Therefore, Ioffe & Szegedy (2015) uses
mini-batches in stochastic gradient training, each mini-batch produces estimates the mean and vari-
ance of each activation. Such simplification makes it possible to involve mean and variance in BP.
From the derivation in section 3.1, we can see batch statistics µ
B
t
, σ
2
B
t
are the Monte Carlo (MC)
estimators of population statistics E[X |Θ
t
], V ar[X|Θ
t
] respectively at iteration t. Similarly, batch
statistics g
B
t
, Ψ
B
t
are MC estimators of population statistics E[
∂L
∂Y
b,:
|Θ
t
], E[Y
b,:
·
∂L
∂Y
b,:
|Θ
t
] at it-
eration t. E[
∂L
∂Y
b,:
|Θ
t
], E[Y
b,:
·
∂L
∂Y
b,:
|Θ
t
] are computed over the whole data set. They contain the
information how the mean and the variance of population will change as model updates, so they play
an important role to make trade off between the change of individual sample and population. There-
fore, it’s crucial to estimate the population statistics precisely, in order to regularize the gradients of
the model properly as weights update.
It’s well known the variance of MC estimator is inversely proportional to the number of samples,
hence the variance of batch statistics dramatically increases when batch size is small. Figure 2 shows
the change of batch statistics from a specific normalization layer of ResNet-50 during training on
ImageNet. Regular batch statistics (orange line) are regarded as a good approximation for population
statistics. We can see small batch statistics (blue line) are highly unstable, and contains notable error
compared with regular batch statistics during training. In fact, the bias of g
B
t
and Ψ
B
t
in BP is
more serious than that of µ
B
t
and σ
2
B
t
(see Figure 2(c), 2(d)). The instability of small batch statistics
can worsen the capacity of the models in two aspects: firstly the instability of small batch statistics
will make training unstable, resulting in slow convergence; Secondly the instability of small batch
can produce huge difference between batch statistics and population statistics. Since the model
is trained using batch statistics while evaluated using population statistics, the difference between
batch statistics and population statistics will cause inconsistency between training and inference
procedure, leading to bad performance of the model on evaluation data.
(a) µ
B
(b) σ
2
B
(c) g
B
(d) Ψ
B
Figure 2: Plot of batch statistics from layer1.0.bn1 in ResNet-50 during training. The formulation of
these batch statistics (µ
B
, σ
2
B
, g
B
, Ψ
B
) have been shown in Section 3.1. Blue line represents the small
batch statistic (|B| = 2) to compute, while orange line represents the regular batch statistics(|B| =
32). The x-axis represents the iterations, while the y-axis represents the l
2
norm of these statistics in
each figures. Notice the mean of g and Ψ is close to zero, hence l
2
norm of g
B
and Ψ
B
essentially
represent their standard deviation.
4 MOVING AVERAGE BATCH NORMALIZATION
Based on the discussion in Section 3.2, the key to restore the performance of BN is to solve the
instability of small batch statistics. Therefore we considered two ways to handle the instability of
4