机器学习算法成功的关键：数据表示综述与新视角

DeepLearning

需积分: 31 81 浏览量更新于2023-06-04 收藏 1.16MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

本文《表示学习：回顾与新视角》由 Yoshua Bengio、Aaron Courville 和 Pascal Vincent 合著，深入探讨了机器学习算法成功的关键——数据表示。作者指出，不同的数据表示方式能够不同程度地揭示数据背后不同解释变量的变化，这直接影响了算法的性能。尽管领域专业知识在设计表示时至关重要，但使用通用先验进行学习也变得越来越重要，尤其是在人工智能追求更强大能力的过程中。文章主要关注无监督特征学习和深度学习领域的最新进展，涵盖了概率模型、自编码器、流形学习和深度网络等多个方面。这些研究进展促使我们思考一系列长期未解的问题，如如何设定学习优质表示的合适目标、如何进行表示计算（即推断）以及表示学习、密度估计和流形学习之间的几何联系。具体来说，作者讨论了以下关键知识点： 1. 概率模型：概率模型在表示学习中的应用提供了结构化的方法，通过建模数据的潜在分布，帮助学习有用的特征表示。例如，贝叶斯方法和生成模型可以捕捉数据背后的复杂关系。 2. 自编码器：这是一种神经网络架构，旨在学习数据的压缩表示，同时还能重构输入。通过训练，自编码器能够学习到数据的内在结构，从而提取出有用的特征。 3. 流形学习：这种方法关注数据集在高维空间中的低维结构，有助于发现数据中的局部结构和模式，这对于识别和分类任务尤其有益。 4. 深度网络：深度学习的核心是深层神经网络，它们通过多层非线性变换来提取高层次的表示。深度学习的成功很大程度上归功于其强大的表征学习能力，尤其是卷积神经网络（CNN）和循环神经网络（RNN）等架构。 5. 表示学习的目标和推断：尚无定论的最佳表示学习目标是什么，因为这可能取决于具体的应用场景。有效的表示应该既能捕获数据的关键信息，又易于后续处理和理解。推断过程则涉及到从数据中有效地提取表示，并利用这些表示进行预测或分类。 6. 几何连接：表示学习、密度估计和流形学习之间存在着深刻的几何联系。理解这些关系有助于我们设计更加高效和理论基础更为坚实的算法。例如，一个好的表示可能对应于一个高密度区域，而流形学习可以帮助找到这样的区域。这篇文章提供了一个全面的概述，既展示了当前表示学习领域的前沿技术，也提出了未来研究的方向，强调了表示学习在构建智能系统中的核心地位。随着深度学习的不断发展，对表示学习的深入理解和创新将继续推动人工智能的进步。

资源详情

资源推荐

unsupervised feature learning adds one layer of weights to

a deep neural network. Finally, the set of layers could be

combined to initialize a deep supervised predictor, such as

a neural network classifier, or a deep generative model,

such as a deep Boltzmann machine (DBM) [176].

This paper is mostly about feature learning algorithms

that can be used to form deep architectures. In particular, it

was empirically observed that layerwise stacking of feature

extraction often yielded better representations, for example,

in terms of classification error [119], [65], quality of the

samples generated by a probabilistic model [176], or in

terms of the invariance properties of the learned features

[71]. Whereas this section focuses on the idea of stacking

single-layer models, Section 10 follows up with a discussion

on joint training of all the layers.

After greedy layerwise unsupervised pretraining, the

resulting deep features can be used either as input to a

standard supervised machine learning predictor (such as an

SVM) or as initialization for a deep supervised neural

network (e.g., by appending a logistic regression layer or

purely supervised layers of a multilayer neural network).

The layerwise procedure can also be applied in a purely

supervised setting, called the greedy layerwise supervised

pretraining [23]. For example, after the first one-hidden-

layer multilayer perceptrons (MLP) is trained, its output

layer is discarded and another one-hidden-layer MLP can

be stacked on top of it, and so on. Although results reported

in [23] were not as good as for unsupervised pretraining,

they were nonetheless better than without pretraining at all.

Alternatively, the outputs of the previous layer can be fed as

extra inputs for the next layer (in addition to the raw input),

as successfully done in [221]. Another variant [184]

pretrains in a supervised way all the previously added

layers at each step of the iteration, and in their experiments,

this discriminant variant yielded better results than un-

supervised pretraining.

Whereas combining single layers into a supervised

model is straightforward, it is less clear how layers

pretrained by unsupervised learning should be combined

to form a better unsupervised model. We cover here some of

the approaches to do so, but no clear winner emerges, and

much work has to be done to validate existing proposals or

improve them.

The first proposal was to stack pretrained RBMs into a

deep belief network [94] or DBN, where the top layer is

interpreted as an RBM and the lower layers as a directed

sigmoid belief network. However, it is not clear how to

approximate maximum likelihood training to further

optimize this generative model. One option is the wake-

sleep algorithm [94], but more work should be done to

assess the efficiency of this procedure in terms of improving

the generative model.

The second approach that has been put forward is to

combine the RBM parameters into a DBM by basically

halving the RBM weights to obtain the DBM weights [176].

The DBM can then be trained by approximate maximum

likelihood as discussed in more detail later (Section 10.2).

This joint training has brought substantial improvements,

both in terms of likelihood and in terms of classification

performance of the resulting deep feature learner [176].

Another early approach was to stack RBMs or autoenco-

ders into a deep autoencoder [92]. If we have a series of encoder-

decoder pairs ðf

ðiÞ

ðÞ;g

ðiÞ

ðÞÞ, then the overall encoder is the

composition of the encoders, f

ðNÞ

ð...f

ð2Þ

ðf

ð1Þ

ðÞÞÞ, and the

overall decoder is its “transpose” (often with transposed

weight matrices as well), g

ð1Þ

ðg

ð2Þ

ð...f

ðNÞ

ðÞÞÞ.Thedeep

autoencoder (or its regularized version, as discussed in

Section 7.2) can then be jointly trained, with all the

parameters optimized with respect to a global reconstruction

error criterion. More work on this avenue clearly needs to be

done, and it was probably avoided by fear of the challenges in

training deep feedforward networks, discussed in Section 10

along with very encouraging recent results.

Yet another recently proposed approach to training deep

architectures [154] is to consider the iterative construction of

a free energy function (i.e., with no explicit latent variables,

except possibly for a top-level layer of hidden units) for a

deep architecture as the composition of transformations

associated with lower layers, followed by top-level hidden

units. The question is then how to train a model defined by

an arbitrary parameterized (free) energy function. Ngiam

et al. [154] have used hybrid Monte Carlo [153], but other

options include contrastive divergence (CD) [88], [94], score

matching [97], [99], denoising score matching [112], [210],

ratio matching [98], and noise-contrastive estimation [80].

5SINGLE-LAYER LEARNING MODULES

Within the community of researchers interested in repre-

sentation learning, there has developed two broad parallel

lines of inquiry: one rooted in probabilistic graphical

models and one rooted in neural networks. Fundamentally,

the difference between these two paradigms is whether the

layered architecture of a deep learning model is to be

interpreted as describing a probabilistic graphical model or

as describing a computation graph. In short, are hidden

units considered latent random variables or as computa-

tional nodes?

To date, the dichotomy between these two paradigms

has remained in the background, perhaps because they

appear to have more characteristics in common than

separating them. We suggest that this is likely a function

of the fact that much recent progress in both of these areas

has focused on single-layer greedy learning modules and the

similarities between the types of single-layer models that

have been explored: mainly, the RBM on the probabilistic

side and the autoencoder variants on the neural network

side. Indeed, as shown by one of us [210] and others [200],

in the case of the RBM, training the model via an inductive

principle known as score matching [97] (to be discussed in

Section 6.4.3) is essentially identical to applying a regular-

ized reconstruction objective to an autoencoder. Another

strong link between pairs of models on both sides of this

divide is when the computational graph for computing

representation in the neural network model corresponds

exactly to the computational graph that corresponds to

inference in the probabilistic model, and this happens to

also correspond to the structure of graphical model itself

(e.g., as in the RBM).

The connection between these two paradigms becomes

more tenuous when we consider deeper models, where in

BENGIO ET AL.: REPRESENTATION LEARNING: A REVIEW AND NEW PERSPECTIVES 1803

the case of a probabilistic model, exact inference typically

becomes intractable. In the case of deep models, the

computational graph diverges from the structure of the

model. For example, in the case of a DBM, unrolling

variational (approximate) inference into a computational

graph results in a recurrent graph structure. We have

performed preliminary exploration [179] of deterministic

variants of deep autoencoders whose computational graph

is similar to that of a DBM (in fact very close to the mean-

field variational approximations associated with the

Boltzmann machine), and that is one interesting inter-

mediate point to explore (between the deterministic

approaches and the graphical model approaches).

In the next few sections, we will review the major

developments in single-layer training modules used to

support feature learning and particularly deep learning. We

divide these sections between (Section 6) the probabilistic

models, with inference and training schemes that directly

parameterize the generative—or decoding—pathway and

(Section 7) the typically neural network-based models that

directly parametrize the encoding pathway. Interestingly,

some models like predictive sparse decomposition (PSD)

[109] inherit both properties and will also be discussed

(Section 7.2.4). We then present a different view of

representation learning, based on the associated geometry

and the manifold assumption, in Section 8.

First, let us consider an unsupervised single-layer

representation learning algorithm spanning all three views:

probabilistic, autoencoder, and manifold learning.

5.1 PCA

We will use probably the oldest feature extraction algo-

rithm, PCA, to illustrate the probabilistic, autoencoder, and

manifold views of representation learning. PCA learns a

linear transformation h ¼ fðxÞ¼W

x þ b of input x 2 IR

where the columns of d

 d

matrix W form an orthogonal

basis for the d

orthogonal directions of greatest variance in

the training data. The result is d

features (the components

of representation h) that are decorrelated. The three

interpretations of PCA are the following: 1) It is related to

probabilistic models (Section 6) such as probabilistic PCA,

factor analysis, and the traditional multivariate Gaussian

distribution (the leading eigenvectors of the covariance

matrix are the principal components); 2) the representation

it learns is essentially the same as that learned by a basic

linear autoencoder (Section 7.2); and 3) it can be viewed as a

simple linear form of linear manifold learning (Section 8), i.e.,

characterizing a lower dimensional region in input space

near which the data density is peaked. Thus, PCA may be in

the back of the reader’s mind as a common thread relating

these various viewpoints. Unfortunately, the expressive

power of linear features is very limited: They cannot be

stacked to form deeper, more abstract representations since

the composition of linear operations yields another linear

operation. Here, we focus on recent algorithms that have

been developed to extract nonlinear features, which can be

stacked in the construction of deep networks, although

some authors simply insert a nonlinearity between learned

single-layer linear projections [125], [43].

Another rich family of feature extraction techniques that

this review does not cover in any detail due to space

constraints is independent component analysis or ICA

[108], [8]. Instead, we refer the reader to [101], [103]. Note

that while in the simplest case (complete, noise free) ICA

yields linear features, in the more general case it can be

equated with a linear generative model with non-Gaussian

independent latent variables, similar to sparse coding

(Section 6.1.1), which result in nonlinear features. Therefore,

ICA and its variants like independent and topographic ICA

[102] can and have been used to build deep networks [122],

[125] (see Section 11.2). The notion of obtaining indepen-

dent components also appears similar to our stated goal of

disentangling underlying explanatory factors through deep

networks. However, for complex real-world distributions, it

is doubtful that the relationship between truly independent

underlying factors and the observed high-dimensional data

can be adequately characterized by a linear transformation.

6PROBABILISTIC MODELS

From the probabilistic modeling perspective, the question

of feature learning can be interpreted as an attempt to

recover a parsimonious set of latent random variables that

describe a distribution over the observed data. We can

express as pðx; hÞ a probabilistic model over the joint space

of the latent variables, h, and observed data or visible

variables x. Feature values are conceived as the result of an

inference process to determine the probability distribution

of the latent variables given the data, i.e., pðh j xÞ, often

referred to as the posterior probability. Learning is conceived

in terms of estimating a set of model parameters that

(locally) maximize the regularized likelihood of the training

data. The probabilistic graphical model formalism gives us

two possible modeling paradigms in which we can consider

the question of inferring latent variables, directed and

undirected graphical models, which differ in their para-

meterization of the joint distribution pðx; hÞ, yielding major

impact on the nature and computational costs of both

inference and learning.

6.1 Directed Graphical Models

Directed latent factor models separately parameterize the

conditional likelihood pðx j hÞ and the prior pðhÞ to construct

the joint distribution, pðx; hÞ¼pðx j hÞpðhÞ. Examples of

this decomposition include: PCA [171], [206], sparse coding

[155], sigmoid belief networks [152], and the newly

introduced spike-and-slab sparse coding (S3C) model [72].

6.1.1 Explaining Away

Directed models often lead to one important property:

explaining away, i.e., a priori independent causes of an event

can become nonindependent given the observation of the

event. Latent factor models can generally be interpreted as

latent cause models, where the h activations cause the

observed x. This renders the a priori independent h to be

nonindependent. As a consequence, recovering the posterior

distribution of h, pðh j xÞ (which we use as a basis for feature

representation), is often computationally challenging and

can be entirely intractable, especially when h is discrete.

A classic example that illustrates the phenomenon is to

imagine you are on vacation away from home and you

receive a phone call from the security system company

1804 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

剩余30页未读，继续阅读

lengwuqin

粉丝: 139
资源: 333

会员权益专享

机器学习算法成功的关键：数据表示综述与新视角

Representation Learning: A Review and New Perspectives

Representation Learning A Review and New Perspectives表示学习.docx

请根据这段文字给出参考文献：深度学习是一种机器学习方法，它基于人工神经网络的架构，能够自动地学习复杂的表征。与传统的机器学习技术相比，深度学习在处理大规模、非结构化、高维度数据方面具有很大的优势。

深度学习参考文献列表

Feature Learning和Representation Learning

Feature Representation Learning for Unsupervised Cross-domain Image Retrieval

人工智能深度学习参考文献

什么是representation learning

deep graph contrastive representation learning

representation learning的概念

momentum contrast for unsupervised visual representation learning

Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting

Modal-aware representation learning是什么

icarl: incremental classifier and representation learning

representation learning

阐述下causal representation learning的国内外研究现状

关于深度学习的外文文献

iCaRL: Incremental Classifier and Representation Learning蒸馏损失如何避免遗忘

2020年的国际会议ICML在哪里举行？Unsupervised representation learning with deep convolutional generative adversarial networks发表在哪里？

基于SpringMVC+Hibernate+AngularJs前后端分离的选课系统+源码+文档+界面展示（毕业设计&课程设计）

会员权益专享

最新资源