深度学习中的BatchNorm：优化加速之谜

需积分: 13 185 浏览量更新于2024-07-17 收藏 1.12MB PDF 举报

"BatchNorm有效性原理探索.pdf" 这篇论文深入探讨了Batch Normalization（BatchNorm）在深度神经网络（DNN）训练中的作用机理。BatchNorm是一种广泛使用的技巧，可以加速并稳定DNN的训练过程。尽管它被广泛应用，但对其效果背后的确切原因的理解仍然不足。传统的观点认为，BatchNorm的效果来自于控制训练过程中层输入分布的变化，以减少所谓的“内部协变量漂移”。然而，论文作者挑战了这一观点，并提出不同的见解。作者通过研究发现，层输入分布的稳定性并不是BatchNorm成功的关键因素。相反，他们揭示了BatchNorm对训练过程的一个更基础的影响：它显著平滑了优化景观。这意味着BatchNorm使得梯度的行为更加可预测和稳定，从而允许更快的训练速度。在深度学习领域，优化是训练模型的核心任务。通常，复杂的网络结构和大量参数会导致训练过程中的优化难题，如梯度消失或爆炸、局部最优等。BatchNorm通过调整每批数据的均值和方差，使得每一层的输入保持在相对稳定的范围内，但这并非其主要优点。真正重要的是，BatchNorm通过改变损失函数的地形，使得网络更容易进行梯度下降，减少了训练过程中的波动，使得模型能够更快地收敛到全局最优解。此外，平滑的优化景观还意味着模型对于参数更新的敏感性降低，这有助于避免训练过程中的不稳定性。在实际应用中，这意味着模型可以使用更大的学习率进行训练，进一步提高了训练效率。同时，更稳定的梯度行为也有助于模型在不同初始化状态下达到一致的性能，增加了训练的鲁棒性。这项工作深化了我们对BatchNorm的理解，表明其核心作用在于改善优化过程，而不是简单地控制输入分布的稳定性。这一发现对于优化理论和深度学习实践具有重要意义，有助于研究人员和工程师更好地设计和调优深度学习模型，提高训练效率和模型性能。

(a) VGG

100

Training Accuracy (%)

LR = 0.1LR = 0.1

Standard

Standard + BatchNorm

0 5k 10k 15k

Steps

100

Training Accuracy (%)

LR = 0.01LR = 0.01

Standard

Standard + BatchNorm

-difference

Layer #5

Cos Angle

Layer #10

-difference

0 5k 10k 15k

Steps

Cos Angle

0 5k 10k 15k

Steps

(b) Deep Linear Network

Training Loss

LR = 1e-06LR = 1e-06

Standard

Standard + BatchNorm

0 5k 10k

Steps

Training Loss

LR = 1e-07LR = 1e-07

Standard

Standard + BatchNorm

-Difference

Layer #9

Cos Angle

Layer #17

-Difference

0 5k 10k

Steps

Cos Angle

0 5k 10k

Steps

Figure 3: Measurement of internal covariate shift in networks with and without BatchNorm layers.

For a layer we measure the cosine angle (ideally

) and

-difference of the gradients (ideally

)

before and after updates to the preceding layers (see Deﬁnition 2.2). Models with BatchNorm have

similar, or even worse, internal covariate shift, despite performing better in terms of accuracy and

loss. (Stabilization of BatchNorm faster during training is an artifact of parameter convergence.)

3 Why does BatchNorm work?

Our investigation so far demonstrated that the generally asserted link between the internal covariate

shift (ICS) and the optimization performance is tenuous, at best. But BatchNorm does signiﬁcantly

improve the training process. Can we explain why this is the case?

Aside from reducing ICS, Ioffe and Szegedy [

] identify a number of additional properties of

BatchNorm. These include prevention of exploding or vanishing gradients, robustness to different

settings of hyperparameters such as learning rate and initialization scheme, and keeping most of the

activations away from saturation regions of non-linearities. All these properties are clearly beneﬁcial

to the training process. But they are fairly simple consequences of the mechanics of BatchNorm

and do little to uncover the underlying factors responsible for BatchNorm’s success. Is there a more

fundamental phenomenon at play here?

3.1 The smoothing effect of BatchNorm

Indeed, we identify the key impact that BatchNorm has on the training process: it reparametrizes

the underlying optimization problem to make its landscape be signiﬁcantly more smooth. The ﬁrst

manifestation of this impact is improvement in the Lipschitzness

of the loss function. That is, the

loss changes at a smaller rate and the magnitudes of the gradients are smaller too. There is, however,

Recall that a function f is L-Lipschitz if |f (x

) − f (x

)| ≤ Lkx

− x

k, for all x

and x

剩余23页未读，继续阅读

Jayxp

粉丝: 6
资源: 137

深度学习中的BatchNorm：优化加速之谜

tensorflow_batchnorm_folding-1.0.4.tar.gz

tensorflow_batchnorm_folding-1.0.9.tar.gz

self.bn1 = nn.BatchNorm1d(128) self.bn2 = nn.BatchNorm1d(128) self.bn3 = nn.BatchNorm1d(256) self.bn4 = nn.BatchNorm1d(512)是什么意思

for name, m in self.deconv_layers.named_modules(): if isinstance(m, nn.BatchNorm2d): nn.init.constant_(m.weight, 1) nn.init.constant_(m.bias, 0)

解释这段代码def sparse_init_weight(model): for m in model.modules(): if isinstance(m, nn.Conv3d): torch.nn.init.sparse_(m.weight, sparsity=0.1) elif isinstance(m, nn.BatchNorm3d): m.weight.data.fill_(1) m.bias.data.zero_() return model

nn.BatchNorm2d(self.in_channels)

最新资源