无监督预训练如何推动深度学习：关键发现与应用

需积分: 9 11 浏览量更新于2024-07-21 收藏 1.23MB PDF 举报

在深度学习领域，近年来的研究焦点主要集中在深度架构如深度信念网络（Deep Belief Networks）和自编码器变种上。这些模型在图像和语言处理等多个任务上取得了显著的进步，特别是在有监督学习任务中。值得注意的是，取得最佳结果往往离不开一个关键步骤：无监督预训练阶段。无监督预训练是指在开始有监督学习之前，模型先通过无标签数据进行自我学习的过程。它的重要性体现在以下几个方面： 1. **解决梯度消失和爆炸问题**：深度神经网络在训练初期，由于反向传播过程中梯度的逐渐消失或放大，使得深层网络很难有效地学习到底层特征。无监督预训练通过构建简单任务（如自编码），可以初始化权重，使梯度传递更加稳定，有助于深层网络的学习。 2. **初始化权重**：无监督学习可以帮助模型找到一个较好的初始参数分布，这在深度学习中尤其重要，因为随机初始化可能导致收敛速度慢或陷入局部最优。预训练模型学习到的底层特征能作为有监督学习的起点，提升模型性能。 3. **特征学习与表示能力**：无监督预训练通过自组织的方式学习数据的内在结构，提取出有用的特征表示。这些通用特征可以被后续的有监督学习任务所复用，减少了对大量标注数据的依赖。 4. **减少过拟合**：预训练模型在未标记数据上进行学习，可以减少对特定训练集噪声的敏感性，从而降低过拟合的风险。在有监督学习时，模型可以从预训练阶段获得的泛化能力中受益。 5. **迁移学习的基础**：无监督预训练是迁移学习的核心组成部分。通过在大规模无标签数据上预训练，模型可以在不同的任务之间共享知识，提高跨领域的学习效率。无监督预训练在深度学习中的作用是至关重要的，它不仅有助于解决深度网络训练中的难题，还能增强模型的泛化能力和适应性，使得深度学习在实际应用中展现出更强的性能和效率。随着研究的深入，未来预训练方法可能会进一步发展，探索更高效、更具普适性的无监督学习策略，推动深度学习技术的持续进步。

WHY DOES UNSUPERVISED PRE-TRAINING HELP DEEP LEARNING?

and Niyogi, 2002; Chapelle et al., 2003). A long-standing variant of this approach is the applica-

tion of Principal Components Analysis as a pre-processing step before applying a classiﬁer (on the

projected data). In these models the data is ﬁrst transformed in a new representation using unsu-

pervised learning, and a supervised classiﬁer is stacked on top, learning to map the data in this new

representation into class predictions.

Instead of having separate unsupervised and supervised components in the model, one can con-

sider models in which P(X) (or P(X,Y)) and P(Y|X) share parameters (or whose parameters are

connected in some way), and one can trade-off the supervised criterion −logP(Y|X) with the un-

supervised or generative one (−logP(X) or −logP(X,Y)). It can then be seen that the generative

criterion corresponds to a particular form of prior (Lasserre et al., 2006), namely that the structure of

P(X) is connected to the structure of P(Y|X) in a way that is captured by the shared parametrization.

By controlling how much of the generative criterion is included in the total criterion, one can ﬁnd a

better trade-off than with a purely generative or a purely discriminative training criterion (Lasserre

et al., 2006; Larochelle and Bengio, 2008).

In the context of deep architectures, a very interesting application of these ideas involves adding

an unsupervised embedding criterion at each layer (or only one intermediate layer) to a traditional

supervised criterion (Weston et al., 2008). This has been shown to be a powerful semi-supervised

learning strategy, and is an alternative to the kind of algorithms described and evaluated in this

paper, which also combine unsupervised learning with supervised learning.

In the context of scarcity of labelled data (and abundance of unlabelled data), deep architectures

have shown promise as well. Salakhutdinov and Hinton (2008) describe a method for learning the

covariance matrix of a Gaussian Process, in which the usage of unlabelled examples for modeling

P(X) improves P(Y|X) quite signiﬁcantly. Note that such a result is to be expected: with few la-

belled samples, modeling P(X) usually helps. Our results show that even in the context of abundant

labelled data, unsupervised pre-training still has a pronounced positive effect on generalization: a

somewhat surprising conclusion.

4.2 Early Stopping as a Form of Regularization

We stated that pre-training as initialization can be seen as restricting the optimization procedure to

a relatively small volume of parameter space that corresponds to a local basin of attraction of the

supervised cost function. Early stopping can be seen as having a similar effect, by constraining the

optimization procedure to a region of the parameter space that is close to the initial conﬁguration

of parameters. With τ the number of training iterations and η the learning rate used in the update

procedure, τη can be seen as the reciprocal of a regularization parameter. Indeed, restricting either

quantity restricts the area of parameter space reachable from the starting point. In the case of the

optimization of a simple linear model (initialized at the origin) using a quadratic error function and

simple gradient descent, early stopping will have a similar effect to traditional regularization.

Thus, in both pre-training and early stopping, the parameters of the supervised cost function

are constrained to be close to their initial values.

A more formal treatment of early stopping as

regularization is given by Sj

oberg and Ljung (1995) and Amari et al. (1997). There is no equivalent

treatment of pre-training, but this paper sheds some light on the effects of such initialization in the

case of deep architectures.

3. In the case of pre-training the “initial values” of the parameters for the supervised phase are those that were obtained

at the end of pre-training.

631

ERHAN, BENGIO, COURVILLE, MANZAGOL, VINCENT AND BENGIO

5. Experimental Setup and Methodology

In this section, we describe the setting in which we test the hypothesis introduced in Section 3 and

previously proposed hypotheses. The section includes a description of the deep architectures used,

the data sets and the details necessary to reproduce our results.

5.1 Models

All of the successful methods (Hinton et al., 2006; Hinton and Salakhutdinov, 2006; Bengio et al.,

2007; Ranzato et al., 2007; Vincent et al., 2008; Weston et al., 2008; Ranzato et al., 2008; Lee

et al., 2008) in the literature for training deep architectures have something in common: they rely

on an unsupervised learning algorithm that provides a training signal at the level of a single layer.

Most work in two main phases. In a ﬁrst phase, unsupervised pre-training, all layers are initialized

using this layer-wise unsupervised learning signal. In a second phase, ﬁne-tuning, a global training

criterion (a prediction error, using labels in the case of a supervised task) is minimized. In the

algorithms initially proposed (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007), the

unsupervised pre-training is done in a greedy layer-wise fashion: at stage k, the k-th layer is trained

(with respect to an unsupervised criterion) using as input the output of the previous layer, and while

the previous layers are kept ﬁxed.

We shall consider two deep architectures as representatives of two families of models encoun-

tered in the deep learning literature.

5.1.1 DEEP BELIEF NETWORKS

The ﬁrst model is the Deep Belief Net (DBN) by Hinton et al. (2006), obtained by training and

stacking several layers of Restricted Boltzmann Machines (RBM) in a greedy manner. Once this

stack of RBMs is trained, it can be used to initialize a multi-layer neural network for classiﬁcation.

An RBM with n hidden units is a Markov Random Field (MRF) for the joint distribution be-

tween hidden variables h

and observed variables x

such that P(h|x) and P(x|h) factorize, that is,

P(h|x) =

∏

P(h

|x) and P(x|h) =

∏

P(x

|h). The sufﬁcient statistics of the MRF are typically h

and h

, which gives rise to the following joint distribution:

P(x,h) ∝ e

′

Wx+b

′

x+c

′

with corresponding parameters θ = (W,b, c) (with

′

denoting transpose, c

associated with h

, b

with x

, and W

with h

). If we restrict h

and x

to be binary units, it is straightforward to show

that

P(x|h) =

∏

P(x

|h) with

P(x

= 1|h) = sigmoid(b

∑

where sigmoid(a) = 1/(1+exp(−a)) (applied element-wise on a vector a), and P(h|x) also has

a similar form:

P(h|x) =

∏

P(h

|x) with

P(h

= 1|x) = sigmoid(c

∑

632

剩余35页未读，继续阅读

hcxksj

粉丝: 0
资源: 2

无监督预训练如何推动深度学习：关键发现与应用

why does unsupervised pre-training help deep learning.pdf

gpt-neox:基于deepspeed库的gpu上类似于gpt-3的模型并行模

Unsupervised texture segmentation using feature distributions

git clone https://github.com/wvangansbeke/Unsupervised-Classification.git cd Unsupervised-Classification

什么是unsupervised loss

Feature Representation Learning for Unsupervised Cross-domain Image Retrieval

C:\Users\86157\Desktop\Unsupervised-Classification-master\Unsupervised-Classification-master\requirements.txt，路径在这儿

cd C:\Users\86157\Desktop\Unsupervised-Classification-master\Unsupervised-Classification-master，如何让conda 下载安装这个文件夹里的东西

unsupervised domain adaptation

C:\Users\86157\Desktop\Unsupervised-Classification-master\Unsupervised-Classification-master\data\imagenet.py，这是什么意思啊

最新资源