稳定批量归一化反向传播的批量统计

需积分: 9 167 浏览量更新于2024-07-09 收藏 1.22MB PDF 举报

"这篇论文是计算机视觉领域的，发表在ICLR 2020会议上，主要探讨了如何稳定批量归一化（Batch Normalization, BN）在反向传播过程中的批处理统计信息，以解决小批量大小导致性能下降的问题。作者们来自上海交通大学智能信息处理重点实验室、复旦大学计算机科学学院以及Megvii Technology公司。" 正文: 批量归一化（Batch Normalization）是深度学习中广泛使用的一种技术，通过标准化每一层的输入，加速训练过程并提高模型的泛化能力。然而，当批处理大小不足时，BN的表现会显著恶化。这个问题限制了它在诸如检测或分割等计算机视觉任务中的应用，因为这些任务通常由于内存消耗限制而采用较小的批量大小。论文中指出，在BN的反向传播过程中存在两个额外的批处理统计量，这影响了其稳定性。传统的BN在前向传播中计算每个批次的均值和方差，而在反向传播中，这些统计信息被用于调整权重更新。然而，小批量可能导致这些统计量的不稳定性，从而影响训练效果。为了克服这个问题，论文作者提出了新的理解和改进方法。他们分析了这些额外的批处理统计量对反向传播的影响，并可能提出了一种策略来稳定这些统计，以恢复BN在小批量情况下的性能。这可能涉及到优化统计估计的方法，或者引入一种机制来平滑不同批次之间的差异，而不必引入额外的非线性操作。此外，论文可能还讨论了其他一些修改后的归一化技术，这些技术试图完全恢复BN的性能，但可能引入了额外的计算复杂性或需要在推理过程中进行非线性操作，导致效率降低。作者的研究目标是在保持高效的同时，改善BN在小批量任务中的表现，使得在资源有限的情况下，仍然可以充分利用BN的优势。这篇研究对于理解和改进深度学习中批量归一化的稳定性具有重要意义，特别是对于那些受制于内存限制而使用小批量的计算机视觉任务。通过解决BN在小批量情况下的问题，有望提升这些任务的训练效率和最终模型的性能。

Published as a conference paper at ICLR 2020

It can be seen from (5) that g

and Ψ

are also batch statistics involved in BN during BP. But they

have never been well discussed before.

3.2 INSTABILITY OF BATCH STATISTICS

According to Ioffe & Szegedy (2015), the ideal normalization is to normalize feature maps X using

expectation and variance computed over the whole training data set:

Y =

X − EX

V ar[X ]

. (6)

But it’s impractical when using stochastic optimization. Therefore, Ioffe & Szegedy (2015) uses

mini-batches in stochastic gradient training, each mini-batch produces estimates the mean and vari-

ance of each activation. Such simpliﬁcation makes it possible to involve mean and variance in BP.

From the derivation in section 3.1, we can see batch statistics µ

, σ

are the Monte Carlo (MC)

estimators of population statistics E[X |Θ

], V ar[X|Θ

] respectively at iteration t. Similarly, batch

statistics g

, Ψ

are MC estimators of population statistics E[

∂L

∂Y

b,:

|Θ

], E[Y

b,:

∂L

∂Y

b,:

|Θ

] at it-

eration t. E[

∂L

∂Y

b,:

|Θ

], E[Y

b,:

∂L

∂Y

b,:

|Θ

] are computed over the whole data set. They contain the

information how the mean and the variance of population will change as model updates, so they play

an important role to make trade off between the change of individual sample and population. There-

fore, it’s crucial to estimate the population statistics precisely, in order to regularize the gradients of

the model properly as weights update.

It’s well known the variance of MC estimator is inversely proportional to the number of samples,

hence the variance of batch statistics dramatically increases when batch size is small. Figure 2 shows

the change of batch statistics from a speciﬁc normalization layer of ResNet-50 during training on

ImageNet. Regular batch statistics (orange line) are regarded as a good approximation for population

statistics. We can see small batch statistics (blue line) are highly unstable, and contains notable error

compared with regular batch statistics during training. In fact, the bias of g

and Ψ

in BP is

more serious than that of µ

and σ

(see Figure 2(c), 2(d)). The instability of small batch statistics

can worsen the capacity of the models in two aspects: ﬁrstly the instability of small batch statistics

will make training unstable, resulting in slow convergence; Secondly the instability of small batch

can produce huge difference between batch statistics and population statistics. Since the model

is trained using batch statistics while evaluated using population statistics, the difference between

batch statistics and population statistics will cause inconsistency between training and inference

procedure, leading to bad performance of the model on evaluation data.

(a) µ

(b) σ

(d) Ψ

Figure 2: Plot of batch statistics from layer1.0.bn1 in ResNet-50 during training. The formulation of

these batch statistics (µ

, σ

, g

, Ψ

) have been shown in Section 3.1. Blue line represents the small

batch statistic (|B| = 2) to compute, while orange line represents the regular batch statistics(|B| =

32). The x-axis represents the iterations, while the y-axis represents the l

norm of these statistics in

each ﬁgures. Notice the mean of g and Ψ is close to zero, hence l

norm of g

and Ψ

essentially

represent their standard deviation.

4 MOVING AVERAGE BATCH NORMALIZATION

Based on the discussion in Section 3.2, the key to restore the performance of BN is to solve the

instability of small batch statistics. Therefore we considered two ways to handle the instability of

剩余16页未读，继续阅读

潜夙

粉丝: 0
资源: 40

稳定批量归一化反向传播的批量统计

40篇ICLR2020计算机视觉开源论文合集.zip

批量重整化：减少批量规范化模型中的小批量依赖Batch Renormalization: Towards Reducing Minibatch Depe.pdf

Towards mobile cryptography

Towards Robust Vision Transformer

Towards Trusted Cloud Computing

Towards Robust Distributed Systems

Structure-CLIP Towards

towards_mobility_management

Towards Eliminating Memory Virtualization Overhead

Towards Pose Rubust Face Recognition

最新资源