Efficient Learning of Sparse Representations
with an Energy-Based Model
Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann LeCun
Courant Institute of Mathematical Sciences
New York University, New York, NY 10003
{ranzato,crispy,sumit,yann}@cs.nyu.edu
Abstract
We describe a novel unsupervised method for learning sparse, overcomplete fea-
tures. The model uses a linear encoder, and a linear decoder preceded by a spar-
sifying non-linearity that turns a code vector into a quasi-binary sparse code vec-
tor. Given an input, the optimal code minimizes the distance between the output
of the decoder and the input patch while being as similar as possible to the en-
coder output. Learning proceeds in a two-phase EM-like fashion: (1) compute
the minimum-energy code vector, (2) adjust the parameters of the encoder and de-
coder so as to decrease the energy. The model produces “stroke detectors” when
trained on handwritten numerals, and Gabor-like filters when trained on natural
image patches. Inference and learning are very fast, requiring no preprocessing,
and no expensive sampling. Using the proposed unsupervised method to initialize
the first layer of a convolutional network, we achieved an error rate slightly lower
than the best reported result on the MNIST dataset. Finally, an extension of the
method is described to learn topographical filter maps.
1 Introduction
Unsupervised learning methods are often used to produce pre-processors and feature extractors for
image analysis systems. Popular methods such as Wavelet decomposition, PCA, Kernel-PCA, Non-
Negative Matrix Factorization [1], and ICA produce compact representations with somewhat uncor-
related (or independent) components. Most methods produce representations that either preserve
or reduce the dimensionality of the input. However, several recent works have advocated the use
of sparse-overcomplete representations for images, in which the dimension of the feature vector is
larger than the dimension of the input, but only a small number of components are non-zero for
any one image [2, 3]. Sparse-overcomplete representations present several potential advantages.
Using high-dimensional representations increases the likelihood that image categories will be easily
(possibly linearly) separable. Sparse representations can provide a simple interpretation of the input
data in terms of a small number of “parts” by extracting the structure hidden in the data. Further-
more, there is considerable evidence that biological vision uses sparse representations in early visual
areas [4, 5].
It seems reasonable to consider a representation “complete” if it is possible to reconstruct the input
from it, because the information contained in the input would need to be preserved in the represen-
tation itself. Most unsupervised learning methods for feature extraction are based on this principle,
and can be understood in terms of an encoder module followed by a decoder module. The encoder
takes the input and computes a code vector, for example a sparse and overcomplete representation.
The decoder takes the code vector given by the encoder and produces a reconstruction of the in-
put. Encoder and decoder are trained in such a way that reconstructions provided by the decoder
are as similar as possible to the actual input data, when these input data have the same statistics
as the training samples. Methods such as Vector Quantization, PCA, auto-encoders [6], Restricted
Boltzmann Machines [7], and others [8] have exactly this architecture but with different constraints
on the code and learning algorithms, and different kinds of encoder and decoder architectures. In