深度前馈神经网络训练难题探索

需积分: 0 120 浏览量更新于2024-08-05 收藏 1.55MB PDF 举报

"深入理解深度前馈神经网络训练的难度" 深度前馈神经网络（Deep Feedforward Neural Networks，DFNN）的训练历来都是一个具有挑战性的课题。在2006年之前，似乎这类多层神经网络无法有效地训练，但自那时起，一系列的算法已经证明了深度网络可以被成功训练，并且实验结果表明，较深的架构相对于浅层架构有显著优势。这些成功主要得益于新的初始化和训练机制。本文的作者Xavier Glorot和Yoshua Bengio来自蒙特利尔大学的DIRO部门，他们旨在更深入地理解为什么标准梯度下降法在随机初始化的深度神经网络上表现不佳，以及近期的成功训练背后的原因，以此来帮助设计未来的优化算法。首先，他们关注到了非线性激活函数的影响。研究发现，逻辑斯蒂sigmoid激活函数对于深度网络的随机初始化并不理想。原因在于其平均值可能导致顶层隐藏层陷入饱和状态，特别是在深度网络中，这会极大地影响网络的学习能力。有趣的是，他们观察到即使在饱和状态下，单元也能自行摆脱饱和，但这需要相当长的时间，这无疑增加了训练的困难。此外，作者还探讨了初始化策略的重要性。他们可能对比了不同的初始化方法，如Xavier初始化或He初始化，这些方法通过调整权重分布，使得网络的梯度在每一层间能够更好地传播，从而解决了深度网络中梯度消失或爆炸的问题。除了激活函数和初始化之外，论文可能还涉及了其他训练策略，如正则化、批量归一化、残差连接等，这些都是近年来提高深度网络训练效果的关键技术。批量归一化可以帮助每一层的激活保持一致的分布，减少内部协变量漂移；而残差连接则通过直接跳过一些层，使得梯度可以直接传递，降低了训练深度网络的难度。该研究揭示了深度学习中的一些核心难题，包括深度网络的训练困难、激活函数的选择以及初始化策略的重要性。这些发现对深度学习领域的理论和实践都有深远影响，为后续的网络优化提供了理论基础。

249

Understanding the difﬁculty of training deep feedforward neural networks

Xavier Glorot Yoshua Bengio

DIRO, Universit

e de Montr

eal, Montr

eal, Qu

ebec, Canada

Abstract

Whereas before 2006 it appears that deep multi-

layer neural networks were not successfully

trained, since then several algorithms have been

shown to successfully train them, with experi-

mental results showing the superiority of deeper

vs less deep architectures. All these experimen-

tal results were obtained with new initialization

or training mechanisms. Our objective here is to

understand better why standard gradient descent

from random initialization is doing so poorly

with deep neural networks, to better understand

these recent relative successes and help design

better algorithms in the future. We ﬁrst observe

the inﬂuence of the non-linear activations func-

tions. We ﬁnd that the logistic sigmoid activation

is unsuited for deep networks with random ini-

tialization because of its mean value, which can

drive especially the top hidden layer into satu-

ration. Surprisingly, we ﬁnd that saturated units

can move out of saturation by themselves, albeit

slowly, and explaining the plateaus sometimes

seen when training neural networks. We ﬁnd that

a new non-linearity that saturates less can often

be beneﬁcial. Finally, we study how activations

and gradients vary across layers and during train-

ing, with the idea that training may be more dif-

ﬁcult when the singular values of the Jacobian

associated with each layer are far from 1. Based

on these considerations, we propose a new ini-

tialization scheme that brings substantially faster

convergence.

1 Deep Neural Networks

Deep learning methods aim at learning feature hierarchies

with features from higher levels of the hierarchy formed

by the composition of lower level features. They include

Appearing in Proceedings of the 13

International Conference

on Artiﬁcial Intelligence and Statistics (AISTATS) 2010, Chia La-

guna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copy-

right 2010 by the authors.

learning methods for a wide array of deep architectures,

including neural networks with many hidden layers (Vin-

cent et al., 2008) and graphical models with many levels of

hidden variables (Hinton et al., 2006), among others (Zhu

et al., 2009; Weston et al., 2008). Much attention has re-

cently been devoted to them (see (Bengio, 2009) for a re-

view), because of their theoretical appeal, inspiration from

biology and human cognition, and because of empirical

success in vision (Ranzato et al., 2007; Larochelle et al.,

2007; Vincent et al., 2008) and natural language process-

ing (NLP) (Collobert & Weston, 2008; Mnih & Hinton,

2009). Theoretical results reviewed and discussed by Ben-

gio (2009), suggest that in order to learn the kind of com-

plicated functions that can represent high-level abstractions

(e.g. in vision, language, and other AI-level tasks), one

may need deep architectures.

Most of the recent experimental results with deep archi-

tecture are obtained with models that can be turned into

deep supervised neural networks, but with initialization or

training schemes different from the classical feedforward

neural networks (Rumelhart et al., 1986). Why are these

new algorithms working so much better than the standard

random initialization and gradient-based optimization of a

supervised training criterion? Part of the answer may be

found in recent analyses of the effect of unsupervised pre-

training (Erhan et al., 2009), showing that it acts as a regu-

larizer that initializes the parameters in a “better” basin of

attraction of the optimization procedure, corresponding to

an apparent local minimum associated with better general-

ization. But earlier work (Bengio et al., 2007) had shown

that even a purely supervised but greedy layer-wise proce-

dure would give better results. So here instead of focus-

ing on what unsupervised pre-training or semi-supervised

criteria bring to deep architectures, we focus on analyzing

what may be going wrong with good old (but deep) multi-

layer neural networks.

Our analysis is driven by investigative experiments to mon-

itor activations (watching for saturation of hidden units)

and gradients, across layers and across training iterations.

We also evaluate the effects on these of choices of acti-

vation function (with the idea that it might affect satura-

tion) and initialization procedure (since unsupervised pre-

training is a particular form of initialization and it has a

drastic impact).

下载后可阅读完整内容，剩余7页未读，立即下载

华亿

粉丝: 46
资源: 308

深度前馈神经网络训练难题探索

Understanding the difficulty of training deep feedforward neural networks

Understanding the difficulty of training deep feedforward neural networks.zip

请给我推荐一些深度学习参数初始化的文章

仔细解释一下xavier_uniform_(m.weight)这个函数

torch.nn.init.xavier_uniform_

sigmod激活函数参考文献

解释死亡ReLu问题，如何解决？

Concept-aware deep knowledge tracing and exercise recommendation in an online learning system

最新资源