Deep Learning via Semi-Supervised Embedding
Jason Weston
∗
jasonw@nec-labs.com
Fr´ed´eric Ratle
†
frederic.ratle@gmail.com
Ronan Collobert
∗
collober@nec-labs.com
(∗) NEC Labs America, 4 Independence Way, Princeton, NJ 08540 USA
(†) IGAR, University of Lausanne, Amphipˆole, 1015 Lausanne, Switzerland
Abstract
We show how nonlinear embedding algo-
rithms popular for use with shallow semi-
supervised learning techniques such as ker-
nel methods can be applied to deep multi-
layer architectures, either as a regularizer at
the output layer, or on each layer of the ar-
chitecture. This provides a simple alterna-
tive to existing approaches to deep learning
whilst yielding competitive error rates com-
pared to those methods, and existing shallow
semi-supervised techniques.
1. Introduction
Embedding data into a lower dimensional space or the
related task of clustering are unsupervised dimension-
ality reduction techniques that have been intensively
studied. Most algorithms are developed with the moti-
vation of producing a useful analysis and visualization
tool.
Recently, the field of semi-supervised learning
(Chapelle et al., 2006), which has the goal of improv-
ing generalization on supervised tasks using unlabeled
data, has made use of many of these techniques. For
example, researchers have used nonlinear embedding
or cluster representations as features for a supervised
classifier, with improved results.
Most of these architectures are disjoint and shallow,
by which we mean the unsupervised dimensionality
reduction algorithm is trained on unlabeled data sep-
arately as a first step, and then its results are fed
to a supervised classifier which has a shallow archi-
tecture such as a (kernelized) linear model. For ex-
ample, several m ethods learn a clustering or a dis-
App earing in Proceedings of the 25
th
International Confer-
ence on Machine Learning, Helsinki, Finland, 2008. Copy-
right 2008 by the author(s)/owner(s).
tance measure based on a nonlinear manifold embed-
ding as a first step (Chapelle et al., 2003; Chapelle &
Zien, 2005). Transductive Support Vector Machines
(TSVMs) (Vapnik, 1998) (which employs a kind of
clustering) and LapSVM (Belkin et al., 2006) (which
employs a kind of embedding) are examples of meth-
ods that are joint in their use of unlabeled data and
labeled data, but their architecture is still shallow.
Deep architectures seem a natural choice in hard AI
tasks which involve several sub-tasks which can be
coded into the layers of the architecture. As argued by
several researchers (Hinton et al., 2006; Bengio et al.,
2007) semi-supervised learning is also natural in such
a setting as otherwise one is not likely to ever have
enough labeled data to perform well.
Several authors have recently proposed methods for
using unlabeled data in deep neural network-based ar-
chitectures. These methods either perform a greedy
layer-wise pre-training of weights using unlabeled data
alone followed by supervised fine-tuning (which can be
compared to the disjoint shallow techniques for semi-
supervised learning described before), or learn unsu-
pervised encodings at multiple levels of the architec-
ture jointly with a supervised signal. Only considering
the latter, the basic setup we advocate is simple:
1. Choose an unsupervised learning algorithm.
2. Choose a model with a deep architecture.
3. The unsup ervised learning is plugged into any (or
all) layers of the architecture as an auxiliary task.
4. Train supervised and unsupervised tasks using the
same architecture simultaneously.
The aim is that the unsupervised method will improve
accuracy on the task at hand. However, the unsu-
pervised methods so far proposed for deep architec-
tures are in our opinion somewhat complicated and