![](https://csdnimg.cn/release/download_crawler_static/8723087/bg7.jpg)
the case of a probabilistic model, exact inference typically
becomes intractable. In the case of deep models, the
computational graph diverges from the structure of the
model. For example, in the case of a DBM, unrolling
variational (approximate) inference into a computational
graph results in a recurrent graph structure. We have
performed preliminary exploration [179] of deterministic
variants of deep autoencoders whose computational graph
is similar to that of a DBM (in fact very close to the mean-
field variational approximations associated with the
Boltzmann machine), and that is one interesting inter-
mediate point to explore (between the deterministic
approaches and the graphical model approaches).
In the next few sections, we will review the major
developments in single-layer training modules used to
support feature learning and particularly deep learning. We
divide these sections between (Section 6) the probabilistic
models, with inference and training schemes that directly
parameterize the generative—or decoding—pathway and
(Section 7) the typically neural network-based models that
directly parametrize the encoding pathway. Interestingly,
some models like predictive sparse decomposition (PSD)
[109] inherit both properties and will also be discussed
(Section 7.2.4). We then present a different view of
representation learning, based on the associated geometry
and the manifold assumption, in Section 8.
First, let us consider an unsupervised single-layer
representation learning algorithm spanning all three views:
probabilistic, autoencoder, and manifold learning.
5.1 PCA
We will use probably the oldest feature extraction algo-
rithm, PCA, to illustrate the probabilistic, autoencoder, and
manifold views of representation learning. PCA learns a
linear transformation h ¼ fðxÞ¼W
T
x þ b of input x 2 IR
d
x
,
where the columns of d
x
d
h
matrix W form an orthogonal
basis for the d
h
orthogonal directions of greatest variance in
the training data. The result is d
h
features (the components
of representation h) that are decorrelated. The three
interpretations of PCA are the following: 1) It is related to
probabilistic models (Section 6) such as probabilistic PCA,
factor analysis, and the traditional multivariate Gaussian
distribution (the leading eigenvectors of the covariance
matrix are the principal components); 2) the representation
it learns is essentially the same as that learned by a basic
linear autoencoder (Section 7.2); and 3) it can be viewed as a
simple linear form of linear manifold learning (Section 8), i.e.,
characterizing a lower dimensional region in input space
near which the data density is peaked. Thus, PCA may be in
the back of the reader’s mind as a common thread relating
these various viewpoints. Unfortunately, the expressive
power of linear features is very limited: They cannot be
stacked to form deeper, more abstract representations since
the composition of linear operations yields another linear
operation. Here, we focus on recent algorithms that have
been developed to extract nonlinear features, which can be
stacked in the construction of deep networks, although
some authors simply insert a nonlinearity between learned
single-layer linear projections [125], [43].
Another rich family of feature extraction techniques that
this review does not cover in any detail due to space
constraints is independent component analysis or ICA
[108], [8]. Instead, we refer the reader to [101], [103]. Note
that while in the simplest case (complete, noise free) ICA
yields linear features, in the more general case it can be
equated with a linear generative model with non-Gaussian
independent latent variables, similar to sparse coding
(Section 6.1.1), which result in nonlinear features. Therefore,
ICA and its variants like independent and topographic ICA
[102] can and have been used to build deep networks [122],
[125] (see Section 11.2). The notion of obtaining indepen-
dent components also appears similar to our stated goal of
disentangling underlying explanatory factors through deep
networks. However, for complex real-world distributions, it
is doubtful that the relationship between truly independent
underlying factors and the observed high-dimensional data
can be adequately characterized by a linear transformation.
6PROBABILISTIC MODELS
From the probabilistic modeling perspective, the question
of feature learning can be interpreted as an attempt to
recover a parsimonious set of latent random variables that
describe a distribution over the observed data. We can
express as pðx; hÞ a probabilistic model over the joint space
of the latent variables, h, and observed data or visible
variables x. Feature values are conceived as the result of an
inference process to determine the probability distribution
of the latent variables given the data, i.e., pðh j xÞ, often
referred to as the posterior probability. Learning is conceived
in terms of estimating a set of model parameters that
(locally) maximize the regularized likelihood of the training
data. The probabilistic graphical model formalism gives us
two possible modeling paradigms in which we can consider
the question of inferring latent variables, directed and
undirected graphical models, which differ in their para-
meterization of the joint distribution pðx; hÞ, yielding major
impact on the nature and computational costs of both
inference and learning.
6.1 Directed Graphical Models
Directed latent factor models separately parameterize the
conditional likelihood pðx j hÞ and the prior pðhÞ to construct
the joint distribution, pðx; hÞ¼pðx j hÞpðhÞ. Examples of
this decomposition include: PCA [171], [206], sparse coding
[155], sigmoid belief networks [152], and the newly
introduced spike-and-slab sparse coding (S3C) model [72].
6.1.1 Explaining Away
Directed models often lead to one important property:
explaining away, i.e., a priori independent causes of an event
can become nonindependent given the observation of the
event. Latent factor models can generally be interpreted as
latent cause models, where the h activations cause the
observed x. This renders the a priori independent h to be
nonindependent. As a consequence, recovering the posterior
distribution of h, pðh j xÞ (which we use as a basis for feature
representation), is often computationally challenging and
can be entirely intractable, especially when h is discrete.
A classic example that illustrates the phenomenon is to
imagine you are on vacation away from home and you
receive a phone call from the security system company
1804 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013