249
Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot Yoshua Bengio
DIRO, Universit
´
e de Montr
´
eal, Montr
´
eal, Qu
´
ebec, Canada
Abstract
Whereas before 2006 it appears that deep multi-
layer neural networks were not successfully
trained, since then several algorithms have been
shown to successfully train them, with experi-
mental results showing the superiority of deeper
vs less deep architectures. All these experimen-
tal results were obtained with new initialization
or training mechanisms. Our objective here is to
understand better why standard gradient descent
from random initialization is doing so poorly
with deep neural networks, to better understand
these recent relative successes and help design
better algorithms in the future. We first observe
the influence of the non-linear activations func-
tions. We find that the logistic sigmoid activation
is unsuited for deep networks with random ini-
tialization because of its mean value, which can
drive especially the top hidden layer into satu-
ration. Surprisingly, we find that saturated units
can move out of saturation by themselves, albeit
slowly, and explaining the plateaus sometimes
seen when training neural networks. We find that
a new non-linearity that saturates less can often
be beneficial. Finally, we study how activations
and gradients vary across layers and during train-
ing, with the idea that training may be more dif-
ficult when the singular values of the Jacobian
associated with each layer are far from 1. Based
on these considerations, we propose a new ini-
tialization scheme that brings substantially faster
convergence.
1 Deep Neural Networks
Deep learning methods aim at learning feature hierarchies
with features from higher levels of the hierarchy formed
by the composition of lower level features. They include
Appearing in Proceedings of the 13
th
International Conference
on Artificial Intelligence and Statistics (AISTATS) 2010, Chia La-
guna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copy-
right 2010 by the authors.
learning methods for a wide array of deep architectures,
including neural networks with many hidden layers (Vin-
cent et al., 2008) and graphical models with many levels of
hidden variables (Hinton et al., 2006), among others (Zhu
et al., 2009; Weston et al., 2008). Much attention has re-
cently been devoted to them (see (Bengio, 2009) for a re-
view), because of their theoretical appeal, inspiration from
biology and human cognition, and because of empirical
success in vision (Ranzato et al., 2007; Larochelle et al.,
2007; Vincent et al., 2008) and natural language process-
ing (NLP) (Collobert & Weston, 2008; Mnih & Hinton,
2009). Theoretical results reviewed and discussed by Ben-
gio (2009), suggest that in order to learn the kind of com-
plicated functions that can represent high-level abstractions
(e.g. in vision, language, and other AI-level tasks), one
may need deep architectures.
Most of the recent experimental results with deep archi-
tecture are obtained with models that can be turned into
deep supervised neural networks, but with initialization or
training schemes different from the classical feedforward
neural networks (Rumelhart et al., 1986). Why are these
new algorithms working so much better than the standard
random initialization and gradient-based optimization of a
supervised training criterion? Part of the answer may be
found in recent analyses of the effect of unsupervised pre-
training (Erhan et al., 2009), showing that it acts as a regu-
larizer that initializes the parameters in a “better” basin of
attraction of the optimization procedure, corresponding to
an apparent local minimum associated with better general-
ization. But earlier work (Bengio et al., 2007) had shown
that even a purely supervised but greedy layer-wise proce-
dure would give better results. So here instead of focus-
ing on what unsupervised pre-training or semi-supervised
criteria bring to deep architectures, we focus on analyzing
what may be going wrong with good old (but deep) multi-
layer neural networks.
Our analysis is driven by investigative experiments to mon-
itor activations (watching for saturation of hidden units)
and gradients, across layers and across training iterations.
We also evaluate the effects on these of choices of acti-
vation function (with the idea that it might affect satura-
tion) and initialization procedure (since unsupervised pre-
training is a particular form of initialization and it has a
drastic impact).