深度学习基石：受限玻尔兹曼机（RBM）训练精要

4星 · 超过85%的资源需积分: 50 45 浏览量更新于2024-07-20 1 收藏 920KB PDF 举报

"本文档是一份深入介绍受限玻尔兹曼机（RBM）训练的教程，作者Asja Fischer和Christian Igel分别来自德国波鸿鲁尔大学的神经信息学研究所和丹麦哥本哈根大学的计算机科学系。教程以马尔可夫随机场的角度解析RBM，并探讨了它们在深度信念网络中的应用以及各种变体和扩展在模式识别任务中的应用。" 受限玻尔兹曼机（Restricted Boltzmann Machines, RBMs）是一种概率图模型，可以被视为随机神经网络。它们在多层学习系统——深度信念网络（Deep Belief Networks, DBNs）中作为基本构建块，受到了广泛的关注。RBM的特殊之处在于它们能够捕获数据的复杂结构，通过无监督学习来提取特征。本教程首先从无向图模型的基础概念出发，解释RBM的数学原理。接着，它讨论了RBM的不同学习算法，其中包括对比散度学习（Contrastive Divergence Learning）和并行 tempering。对比散度学习是一种常用的学习策略，它试图近似最大似然估计，但避免了完全的马尔可夫链蒙特卡洛（MCMC）采样带来的计算复杂性。并行 tempering 是一种用于优化全局最优解的MCMC技术，尤其适用于多模态分布。由于RBM的训练和采样过程通常涉及马尔可夫链和MCMC方法，教程中还包含了对这些基础概念的介绍。马尔可夫链是一种状态转移模型，用于描述系统如何从一个状态转移到另一个状态，而MCMC则是一种基于马尔可夫链的统计抽样技术，用于生成目标分布的样本。此外，教程还可能涵盖了RBM的训练过程中的其他重要主题，如负梯度下降、能量函数的定义、权重更新规则以及如何在训练过程中平衡模型的拟合与泛化能力。RBM的训练通常涉及到可见层和隐藏层之间的交互，以及如何通过这些交互来学习数据的有效表示。最后，教程可能会讨论RBM在图像识别、语音识别、自然语言处理等模式识别任务中的实际应用案例，以及如何将预训练的RBM集成到深度学习架构中以进一步提升性能。这篇教程提供了一个全面且深入的RBM训练指南，适合那些希望理解和掌握这一强大工具的机器学习和数据科学从业者。通过阅读和实践，读者将能够理解RBM的工作原理，掌握其训练方法，并能够在实际项目中有效地运用RBM。

S under the MRF distribution, i.e., training corresponds to ﬁnding the parameters θ that maximize

the likelihood given the training data. The likelihood L : Θ → R of an MRF given the data set S maps

parameters θ from a parameter space Θ to L(θ | S) =

ℓ

i=1

p(x

| θ). Maximizing the likelihood is the

same as maximiz in g the log-likelihood given by

ln L( θ | S) = ln

ℓ

i=1

p(x

| θ) =

ℓ

i=1

ln p(x

| θ) . (4)

For the Gibbs distribution of an MRF, it is in general not possible to ﬁnd the maximum likelihood

parameters analytically. Thus, numerical approximations have to be used, for example gradient ascent,

which is described below.

Maximizing the likelihood corresponds to minimizing the distance b etween the unknown distribu-

tion q underlying S and the distribution p of the MRF in terms of th e Kullback–Leibler divergence

(KL divergence), which for a ﬁnite state space Ω is given by

KL(q||p) =

x∈Ω

q(x) ln

q(x)

p(x)

x∈Ω

q(x) ln q(x) −

x∈Ω

q(x) ln p(x) . (5)

The KL divergence is a (non-symmetric) measure of the diﬀerence between two distributions. It is

always positive, and it is zero if and only if the distributions are the same. As becomes clear by

equation (5), the KL divergence can be expressed as the diﬀerence between the entropy of q and a

second term. Only the latter depends on the parameters su bject to optimization. Approximating the

expectation over q in this term by the training samples from q results in the log-likelihood. Therefore,

maximizing the log-likelihood corres ponds to minimizing the KL divergence.

Optimization by gradient ascent. If it is not possible to ﬁnd parameters maximizing the likelihood

analytically, the us ual way to ﬁnd them is by gradient ascent on the log-likelihood. This corresponds

to iteratively updati ng the parameters θ

(t)

to θ

(t+1)

based on the gradient of the log-likelihood. Let

us consider the following update rule:

(t+1)

= θ

(t)

+ η

∂

∂θ

(t)



ln L( θ

(t)

| S)



−λθ

(t)

+ ν∆θ

(t−1)

{z }

= ∆θ

(t)

(6)

If the constants λ ∈ R

and ν ∈ R

are set to zero, we have vanilla gradient ascent. The constant

η ∈ R

is the learning rate. As we will see later, it can be desirable to strive for mode ls with weights

having small abs olu te values. To achieve th is , we can optimize an objective function consisting of the

log-likelihood minus half of the norm of the parameters kθk

/2 weighted by λ. This method is called

weight decay, and penalizes weights with large magnitude. It leads to the −λθ

(t)

term in our update

rule (6). In a Bayesian framework, weight decay can be interpreted as assuming a zero-mean Gaussian

prior on the parameters. The update rule can be further extended by a momentum term, ∆θ

(t−1)

weighted by the parameter ν. Using a momentum term hel ps against oscillations in the iterative update

procedure and can speed up the learning process, as is seen in feed-forward neural network training

[43].

Introducing latent variables. Suppose we want to model an m-dimensional unknown pr ob abili ty

distribution q (e.g., each component of a sample cor r es ponds to one of m pixels of an image). Typically,

not all the variables X = (X

)

v∈V

in an MRF need to correspond to some observed component,

and the number of nodes is larger than m. We split X into visible (or observed) variabl es V =

, . . . , V

) corresponding to the components of the observations and latent (or hidden) variables H =

剩余26页未读，继续阅读

风翼冰舟

粉丝: 2478
资源: 54

深度学习基石：受限玻尔兹曼机（RBM）训练精要

RBM详细例子详解以及介绍

DBN_rbm_DBNmatlab_dbn_dbn预训练

RBM的python代码实现

RBM算法详解：从入门到精通

深度学习笔记：RBM使用详解与能量模型转换

受限玻尔兹曼机RBM

受限玻尔兹曼机（RBM）学习

nnet_rbm_深度学习_DPN神经网络_

深度学习：RBM与DBN详解

受限玻尔兹曼机RBM详解：网络结构与概率分布

最新资源