深度网络的贪心分层训练策略

下载需积分: 9 | PDF格式 | 310KB | 更新于2024-09-09 | 58 浏览量 | 举报

"这篇资源主要讨论了贪婪分层训练（Greedy Layer-Wise Training）在深度网络中的应用，特别是如何解决深度神经网络的训练难题。文章由Yoshua Bengio等人撰写，他们来自蒙特利尔大学，研究关注于如何通过逐层无监督学习策略改进深度信念网络（Deep Belief Networks, DBN）的训练效果。深度网络由于其多层非线性结构，理论上可以更高效地表示某些复杂函数，有时甚至在计算元素需求上比浅层架构有指数级优势。然而，传统的基于梯度优化的方法在随机初始化时往往陷入局部最优解，导致训练困难。Hinton等人提出的贪婪分层无监督学习算法为DBN的训练提供了新思路。DBN是一种具有多层隐藏因果变量的生成模型。作者们对这一算法进行了实证研究，探讨了其成功的原因，并尝试扩展到输入是连续数据或输入分布结构不明确的监督任务中。实验结果证实，贪婪分层无监督训练策略主要通过将权重初始化在接近良好局部最小值的区域，从而生成内部的分布式表示，这些表示是输入的高级抽象，有助于提高泛化能力。文章进一步探索了算法的变体，旨在更好地理解其工作原理并扩大其适用范围。这表明，这种分层训练方法对于优化过程非常有益，能帮助深度网络在初始阶段就建立有利的权重配置，进而提升模型的整体性能。" 该资源主要涵盖了以下几个知识点： 1. 深度网络的优势：在表示复杂函数时，深度多层神经网络的效率可能远高于浅层网络。 2. 训练深度网络的挑战：梯度下降法在随机初始化时可能陷入局部最优，难以收敛至全局最优。 3. 贪婪分层训练：Hinton等人的解决方案，通过逐层无监督预训练初始化权重。 4. 深度信念网络（DBN）：一种多层隐藏单元的生成模型，适用于贪婪分层训练。 5. 实验与扩展：算法的实证研究，包括对连续输入和输入分布不明确情况的适应性。 6. 分布式表示与泛化：贪婪分层训练产生的内部表示有助于提高模型的泛化能力。 7. 算法变体：对原始算法的改进和扩展，以增进理解和应用范围。这篇资源对于理解和改进深度学习模型的训练策略，尤其是深度信念网络的训练，具有重要的理论和实践价值。

The layer-to-layer conditionals associated with the RBM factorize like in (1) and give rise to

P (v

= 1|h) = sigm(b

) and Q(h

= 1|v) = sigm(c

2.2 Gibbs Markov chain and log-likelihood gradient in an RBM

To obtain an estimator of the gradient on the log-likelihood of an RBM, we consider a Gibbs Markov

chain on the (visible units, hidden units) pair of variables. Gibbs sampling from an RBM proceeds by

sampling h given v, then v given h, etc. Denote v

for the t-th v sample from that chain, starting at

t = 0 with v

, the “input observation” for the RBM. Therefore, (v

, h

) for k → ∞ is a sample from

the joint P (v, h). The log-likelihood of a value v

under the model of the RBM is

log P (v

) = log

P (v

, h) = log

−energy(v

,h)

− log

v,h

−energy(v,h)

and its gradient with respect to θ = (W, b, c) is

∂ log P (v

)

∂θ

= −

Q(h

)

∂energy(v

, h

)

∂θ

P (v

, h

)

∂energy(v

, h

)

∂θ

for k → ∞. An unbiased sample is −

∂energy(v

, h

)

∂θ

+ E



∂energy(v

, h

)

∂θ



where h

is a sample from Q(h

) and (v

, h

) is a sample of the Markov chain, and the expecta-

tion can be easily computed thanks to P (h

) factorizing. The idea of the Contrastive Divergence

algorithm (Hinton, 2002) is to take k small (typically k = 1). A pseudo-code for Contrastive Di-

vergence training (with k = 1) of an RBM with binomial input and hidden units is presented in the

Appendix (Algorithm RBMupdate(x, , W, b, c)). This procedure is called repeatedly with v

= x

sampled from the training distribution for the RBM. To decide when to stop one may use a proxy for

the training criterion, such as the reconstruction error − log P (v

= x|v

= x).

2.3 Greedy layer-wise training of a DBN

A greedy layer-wise training algorithm was proposed (Hinton et al., 2006) to train a DBN one layer at

a time. One ﬁrst trains an RBM that takes the empirical data as input and models it. Denote Q(g

)

the posterior over g

associated with that trained RBM (we recall that g

= x with x the observed

input). This gives rise to an “empirical” distribution bp

over the ﬁrst layer g

, when g

is sampled

from the data empirical distribution bp: we have bp

) =

bp(g

)Q(g

Note that a 1-level DBN is an RBM. The basic idea of the greedy layer-wise strategy is that after

training the top-level RBM of a `-level DBN, one changes the interpretation of the RBM parameters

to insert them in a (` + 1)-level DBN: the distribution P (g

`−1

) from the RBM associated with

layers ` − 1 and ` is kept as part of the DBN generative model. In the RBM between layers ` − 1

and `, P (g

) is deﬁned in terms on the parameters of that RBM, whereas in the DBN P(g

) is deﬁned

in terms of the parameters of the upper layers. Consequently, Q(g

`−1

) of the RBM does not

correspond to P (g

`−1

) in the DBN, except when that RBM is the top layer of the DBN. However,

we use Q(g

`−1

) of the RBM as an approximation of the posterior P (g

`−1

) for the DBN.

The samples of g

`−1

, with empirical distribution bp

`−1

, are converted stochastically into samples of g

with distribution bp

through bp

) =

`−1

)Q(g

`−1

). Although bp

cannot be rep-

resented explicitly it is easy to sample unbiasedly from it: pick a training example and propagate it

stochastically through the Q(g

i−1

) at each level. As a nice side beneﬁt, one obtains an approxi-

mation of the posterior for all the hidden variables in the DBN, at all levels, given an input g

= x.

Mean-ﬁeld propagation (see below) gives a fast deterministic approximation of posteriors P (g

|x).

Note that if we consider all the layers of a DBN from level i to the top, we have a smaller DBN,

which generates the marginal distribution P (g

) for the complete DBN. The motivation for the greedy

procedure is that a partial DBN with ` − i levels starting above level i may provide a better model for

P (g

) than does the RBM initially associated with level i itself.

The above greedy procedure is justiﬁed using a variational bound (Hinton et al., 2006). As a con-

sequence of that bound, when inserting an additional layer, if it is initialized appropriately and has

enough units, one can guarantee that initial improvements on the training criterion for the next layer

剩余12页未读，继续阅读

MilleLee

粉丝: 0

深度网络的贪心分层训练策略

深度学习全面教程：从理论到实践

MATLAB路径搜索算法比较：Dijkstra、A*与Greedy Best-First

MakeCode项目：greedy-man扩展功能介绍

Greedy Layer-Wise Training of Deep Networks

台大-李宏毅-B站机器学习视频-课件

【Advanced Section】In-depth Study of Neural Networks: Deep Belief Networks and Adaptive Learning ...

cole_02_0507.pdf

工程硕士开题报告：无线传感器网络路由技术及能量优化LEACH协议研究

【东海期货-2025研报】东海贵金属周度策略：金价高位回落，阶段性回调趋势初现.pdf

图像数据处理工具+数据(帮助用户快速划分数据集并增强图像数据集。通过自动化数据处理流程，简化了深度学习项目的数据准备工作)

最新资源