Convolutional Deep Belief Networks
for Scalable Unsupervised Learning of Hierarchical Representations
Honglak Lee hllee@cs.stanford.edu
Roger Grosse rgrosse@cs.stanford.edu
Rajesh Ranganath rajeshr@cs.stanford.edu
Andrew Y. Ng ang@cs.stanford.edu
Computer Science Department, Stanford University, Stanford, CA 94305, USA
Abstract
There has b e en much interest in unsuper-
vised learning of hierarchical generative mod-
els such as deep belief networks. Scaling
such models to full-sized, high-dimensional
images remains a difficult problem. To ad-
dress this problem, we present the convolu-
tional deep belief network, a hierarchical gen-
erative model which scales to realistic image
sizes. This model is translation-invariant and
supports efficient bottom-up and top-down
probabilistic inference. Key to our approach
is probabilistic max-pooling,anoveltechnique
which shrinks the representations of higher
layers in a probabilistically sound way. Our
experiments show that the algorithm learns
useful high-level visual features, such as ob-
ject parts, from unlabeled images of objects
and natural scenes. We demonstrate excel-
lent performance on several visual recogni-
tion tasks and show that our model can per-
form hierarchical (bottom-up and top-down)
inference over full-sized images.
1. Introduction
The visual world can be described at many levels: pixel
intensities, e dges, object parts, objects, and beyond.
The prospect of learning hierarchical models which
simultaneously represent multiple levels has recently
generated much interest. Ideally, such “deep” repre-
sentations would learn hierarchies of feature detectors,
and further be able to combine top-down and bottom-
up processing of an image. For instance, lower layers
could support object detection by spotting low-level
features indicative of object parts. Conversely, infor-
mation about objects in the higher layers could resolve
App earing in Proceedings of the 26
th
International Confer-
ence on Machine Learning, Montreal, Canada, 2009. Copy-
right 2009 by the author(s)/owner(s).
lower-level ambiguities in the image or infer the loca-
tions of hidden object parts.
Deep architectures consist of feature detector units ar-
ranged in layers. Lower layers detect simple features
and feed into higher layers, which in turn detect more
complex features. There have been several approaches
to learning deep networks (LeCun et al., 1989; Bengio
et al., 2006; Ranzato et al., 2006; Hinton et al., 2006).
In particular, the deep belief network (DBN) (Hinton
et al., 2006) is a multilayer generative model where
each layer encodes statistical dependencies among the
units in the layer below it; it is trained to (approxi-
mately) maximize the likelihood of its training data.
DBNs have been successfully used to learn high-level
structure in a wide variety of domains, including hand-
written digits (Hinton et al., 2006) and human motion
capture data (Taylor et al., 2007). We build upon the
DBN in this paper because we are interested in learn-
ing a generative model of images which can be trained
in a purely unsupervised manner.
While DBNs have been successful in controlled do-
mains, scaling them to realistic-sized (e.g., 200x200
pixel) images remains challenging for two reasons.
First, images are high-dimensional, so the algorithms
must scale gracefully and be computationally tractable
even when applied to large images. Second, objects
can appear at arbitrary lo c ations in images; thus it
is desirable that representations be invariant at least
to local translations of the input. We address these
issues by incorporating translation invariance. Like
LeCun et al. (1989) and Grosse et al. (2007), we
learn feature detectors which are shared among all lo-
cations in an image, because features which capture
useful information in one part of an image can pick up
the same information elsewhere. Thus, our model can
represent large images using only a small number of
feature detectors.
This paper presents the convolutional deep belief net-
work, a hierarchical generative model that scales to
full-sized images. Another key to our approach is