
Self-taught Learning: Transfer Learning from Unlabeled Data
Rajat Raina rajatr@cs.stanford.edu
Alexis Battle ajbattle@cs.stanford.edu
Honglak Lee hllee@cs.stanford.edu
Benjamin Packer bpacker@cs.stanford.edu
Andrew Y. Ng ang@cs.stanford.edu
Computer Science Department, Stanford University, CA 94305 USA
Abstract
We present a new machine learning frame-
work called “self-taught learning” for using
unlabeled data in supervised classification
tasks. We do not assume that the unla-
beled data follows the same class labels or
generative distribution as the labeled data.
Thus, we would like to use a large number
of unlabeled images (or audio samples, or
text documents) randomly downloaded from
the Internet to improve performance on a
given image (or audio, or text) classification
task. Such unlabeled data is significantly eas-
ier to obtain than in typical semi-supervised
or transfer learning settings, making self-
taught learning widely applicable to many
practical learning problems. We describe an
approach to self-taught learning that uses
sparse coding to construct higher-level fea-
tures using the unlabeled data. These fea-
tures form a succinct input representation
and significantly improve classification per-
formance. When using an SVM for classifi-
cation, we further show how a Fisher kernel
can be learned for this representation.
1. Introduction
Labeled data for machine learning is often very diffi-
cult and expensive to obtain, and thus the ability to
use unlabeled data holds significant promise in terms
of vastly expanding the applicability of learning meth-
ods. In this paper, we study a novel use of unlabeled
data for improving performance on supervised learn-
ing tasks. To motivate our discussion, consider as a
running example the computer vision task of classi-
fying images of elephants and rhinos. For this task,
it is difficult to obtain many labeled examples of ele-
phants and rhinos; indeed, it is difficult even to obtain
many unlabeled examples of elephants and rhinos. (In
fact, we find it difficult to envision a process for col-
lecting such unlabeled images, that does not immedi-
Appearing in Proceedings of the 24
th
International Confer-
ence on Machine Learning, Corvallis, OR, 2007. Copyright
2007 by the author(s)/owner(s).
ately also provide the class labels.) This makes the
classification task quite hard with existing algorithms
for using labeled and unlabeled data, including most
semi-supervised learning algorithms such as the one
by Nigam et al. (2000). In this paper, we ask how un-
labeled images from other object classes—which are
much easier to obtain than images specifically of ele-
phants and rhinos—can be used. For example, given
unlimited access to unlabeled, randomly chosen im-
ages downloaded from the Internet (probably none of
which contain elephants or rhinos), can we do better
on the given supervised classification task?
Our approach is motivated by the observation that
even many randomly downloaded images will contain
basic visual patterns (such as edges) that are similar
to those in images of elephants and rhinos. If, there-
fore, we can learn to recognize such patterns from the
unlabeled data, these patterns can be used for the su-
pervised learning task of interest, such as recognizing
elephants and rhinos. Concretely, our approach learns
a succinct, higher-level feature representation of the in-
puts using unlabeled data; this representation makes
the classification task of interest easier.
Although we use computer vision as a running exam-
ple, the problem that we pose to the machine learning
community is more general. Formally, we consider
solving a supervised learning task given labeled and
unlabeled data, where the unlabeled data does not
share the class labels or the generative distribution of
the labeled data. For example, given unlimited access
to natural sounds (audio), can we perform better
speaker identification? Given unlimited access to news
articles (text), can we perform better email foldering
of “ICML reviewing” vs. “NIPS reviewing” emails?
Like semi-supervised learning (Nigam et al., 2000),
our algorithms will therefore use labeled and unlabeled
data. But unlike semi-supervised learning as it is typ-
ically studied in the literature, we do not assume that
the unlabeled data can be assigned to the supervised
learning task’s class labels. To thus distinguish our
formalism from such forms of semi-supervised learn-
ing, we will call our task self-taught learning.
There is no prior general, principled framework for
incorporating such unlabeled data into a supervised
759