WHY DOES UNSUPERVISED PRE-TRAINING HELP DEEP LEARNING?
and Niyogi, 2002; Chapelle et al., 2003). A long-standing variant of this approach is the applica-
tion of Principal Components Analysis as a pre-processing step before applying a classifier (on the
projected data). In these models the data is first transformed in a new representation using unsu-
pervised learning, and a supervised classifier is stacked on top, learning to map the data in this new
representation into class predictions.
Instead of having separate unsupervised and supervised components in the model, one can con-
sider models in which P(X) (or P(X,Y)) and P(Y|X) share parameters (or whose parameters are
connected in some way), and one can trade-off the supervised criterion −logP(Y|X) with the un-
supervised or generative one (−logP(X) or −logP(X,Y)). It can then be seen that the generative
criterion corresponds to a particular form of prior (Lasserre et al., 2006), namely that the structure of
P(X) is connected to the structure of P(Y|X) in a way that is captured by the shared parametrization.
By controlling how much of the generative criterion is included in the total criterion, one can find a
better trade-off than with a purely generative or a purely discriminative training criterion (Lasserre
et al., 2006; Larochelle and Bengio, 2008).
In the context of deep architectures, a very interesting application of these ideas involves adding
an unsupervised embedding criterion at each layer (or only one intermediate layer) to a traditional
supervised criterion (Weston et al., 2008). This has been shown to be a powerful semi-supervised
learning strategy, and is an alternative to the kind of algorithms described and evaluated in this
paper, which also combine unsupervised learning with supervised learning.
In the context of scarcity of labelled data (and abundance of unlabelled data), deep architectures
have shown promise as well. Salakhutdinov and Hinton (2008) describe a method for learning the
covariance matrix of a Gaussian Process, in which the usage of unlabelled examples for modeling
P(X) improves P(Y|X) quite significantly. Note that such a result is to be expected: with few la-
belled samples, modeling P(X) usually helps. Our results show that even in the context of abundant
labelled data, unsupervised pre-training still has a pronounced positive effect on generalization: a
somewhat surprising conclusion.
4.2 Early Stopping as a Form of Regularization
We stated that pre-training as initialization can be seen as restricting the optimization procedure to
a relatively small volume of parameter space that corresponds to a local basin of attraction of the
supervised cost function. Early stopping can be seen as having a similar effect, by constraining the
optimization procedure to a region of the parameter space that is close to the initial configuration
of parameters. With τ the number of training iterations and η the learning rate used in the update
procedure, τη can be seen as the reciprocal of a regularization parameter. Indeed, restricting either
quantity restricts the area of parameter space reachable from the starting point. In the case of the
optimization of a simple linear model (initialized at the origin) using a quadratic error function and
simple gradient descent, early stopping will have a similar effect to traditional regularization.
Thus, in both pre-training and early stopping, the parameters of the supervised cost function
are constrained to be close to their initial values.
3
A more formal treatment of early stopping as
regularization is given by Sj
¨
oberg and Ljung (1995) and Amari et al. (1997). There is no equivalent
treatment of pre-training, but this paper sheds some light on the effects of such initialization in the
case of deep architectures.
3. In the case of pre-training the “initial values” of the parameters for the supervised phase are those that were obtained
at the end of pre-training.
631