变分推断基础与现代方法：2016 NIPS 教程

下载需积分: 9 | PDF格式 | 25.64MB | 更新于2024-07-20 | 133 浏览量 | 举报

变分推断 Tutorial 变分推断（Variational Inference）是一种基于概率论的推断方法，旨在近似复杂的概率分布。2016 年 NIPS 大会上，David Blei、Rajesh Ranganath 和 Shakir Mohamed 三位专家共同举办了一场关于变分推断的 tutorial，旨在为研究人员和学生们提供一个深入了解变分推断的机会。变分推断的基础知识变分推断是基于 Variational Principle 的一个推断方法，该原理认为，任何一个概率分布都可以被近似为一个简单的概率分布。这种近似可以通过最小化 KL 散度（Kullback-Leibler divergence）来实现。KL 散度是衡量两个概率分布之间差异的度量，它可以用来衡量两个概率分布的相似度。变分推断的优点包括： * 可以近似复杂的概率分布 * 可以处理高维度的数据 * 可以实现快速的计算变分推断的应用包括： * 机器学习：变分推断可以用于机器学习模型中的参数推断 * 图像处理：变分推断可以用于图像处理中的图像去噪和图像超分辨 * 自然语言处理：变分推断可以用于自然语言处理中的语言模型和文本分类变分推断的现代方法现代的变分推断方法包括： * 黑盒变分推断（Black-Box Variational Inference）：这种方法使用神经网络来近似概率分布 * 变分自动编码器（Variational Autoencoder）：这种方法使用自动编码器来近似概率分布 * 变分随机梯度下降（Variational Stochastic Gradient Descent）：这种方法使用随机梯度下降来近似概率分布变分推断的应用示例在实践中，变分推断可以用于各种应用场景，例如： * 社交网络分析：变分推断可以用于社交网络中的社区检测和网络结构分析 * 图像处理：变分推断可以用于图像处理中的图像去噪和图像超分辨 * 自然语言处理：变分推断可以用于自然语言处理中的语言模型和文本分类在 tutorial 中，作者们还提供了一些实践示例，例如使用变分推断来分析社交网络中的社区结构。他们使用了一个包含 3.7M 节点的社交网络，并使用变分推断来检测社区结构。结果表明，变分推断可以有效地检测社交网络中的社区结构。变分推断是一个功能强大且实用的推断方法，它可以被应用于各种领域，例如机器学习、图像处理和自然语言处理。

History

1006

Carsten Peter son and J ames R . An derson

Coovergence 0 1 eM ColTeia llon Sta

bile

-1 XOR wilh Random

WetgltJ

- 0. 5

' 0 0

1000 10000

t.\ntler

01 SWe8plll

Figure

5: {sf'Bo ut} a nd

out

from th e B M

and

MFT

respec t ively

as fu

nct

ionsof

Nsweep

o For detailson archite ct ure, an nealing sche d ule ,

an d

j values, see figure 3.

Corw

flfOOOC

ll 01

L4e

relallon Oi

llerence

2-4 -1 XOR with Random Welltll l

•

_ _ _

' 0 ' 0 0

1000 10000

tbnber

01 SWeepli

Figure 6 :

as defined in equa tion (3 .17) as a fu n ctio n of N

stKeep

•

For

detai lson architec t u re, a nne ali ng sched ule, and

T ij values, see figure

[Peterson and Anderson 1987]

(a)

(b)

Figure 22: (a) A node S

in a sigmoid belief network machine with its Markov blanket. (b)

The mean ﬁeld equations yield a deterministic relationship, represented in the ﬁgure with

the dotted lines, between the variational parameters µ

and µ

for no des j in the Markov

blanket of node i.

atractablelowerboundontheloglikelihoodandthevariationalparameterξ

can be

optimized along with the other variational parameters.

Saul and Jordan (1998) show that in the limiting case of networks in which each hidden

node has a large number of parents, so that a central limit theorem can be invoked, the

parameter ξ

has a probabilistic interpretation as the approximate exp ectation of σ(z

where σ(·)isagainthelogisticfunction.

For ﬁxed values o f the paramete rs ξ

,bydiﬀerentiating the KL divergence with respect

to the variational parameters µ

,weobtainthefollowingconsistencyequations:

= σ

⎛

⎝

+ θ

(µ

−ξ

⎞

⎠

(67)

where K

is the derivative of −ln

−ξ

+ e

(1−ξ

with respect to µ

.AsSaul,etal.

show, this term depends on node i,itschildj,andtheotherparents(the“co-parents”)of

node j.Giventhattheﬁrsttermisasumovercontributionsfromtheparentsofnodei,

and the second term is a sum over contributions from the children of node i,weseethatthe

consistency equation for a given node again involves contributions from the Markov blanket

of the node (see Fig. 22). Thus, as in the case of the Boltzmann machine, we ﬁnd that the

variational parameters are linked via their Markov blankets and the consistency equation

(Eq. (67)) can be interpreted as a local message-passing algorithm.

Saul, Jaakkola, and Jordan (1996) and Saul and Jordan (1998) also show how to update

the variational parameters ξ

.Thetwopapersutilizetheseparametersinslightlydiﬀerent

ways and obtain diﬀerent update equations. (Yet another related variational approximation

for the sigmoid b elief network, including b oth upper and lower bounds, is presented in

Jaakkola and Jordan, 1996).

Finally, we can compute the gradient with respect to the parameters θ

for ﬁxed vari-

ational parameters µ and ξ.TheresultobtainedbySaulandJordan(1998)takesthe

[Jordan et al. 1999]

Figure 2: The final weights of the network. Each

large block represents one hidden unit. The small

black or white rectangles represent negative or

positive weights with the area of a rectangle rep

resenting the magnitude of the weight. The bot-

tom 12 rows in each block represent the incoming

weights of the hidden unit. The central weight at

the top of each block is the weight from the hidden

unit to the linear output unit. The weight at the

top-right of a block is the bias of the hidden unit.

‘~

-2 2

Figure 3: The final probability distribution that

is used for coding the weights. This distribution

is implemented by adapting the means, variances

and mixing proportions of five gauasians.

is clear that the weights form three fairly sharp clus-

ters. Figure 3 shows that the mixture of 5 Gaussians

has adapted to implement the appropriate coding-prior

for this weight distribution.

The performance of the network can be measured by

comparing the squared error it achievea on the test data

with the error that would be achieved by simply guess-

ing the mean of the correct answera for the test data:

Relative Error =

~c(dc - y.)’

~c(dc - ~)2

(27)

We ran the optimization five times using different ran-

domly chosen valuea for the initial means of the noisy

weights. For the network that achieved the lowest value

of the overall cost function, the relative error was 0.286.

This compares with a relative error of 0.967 for the same

network when we used noise-free weights and did not

penalize their information content. The best relative

error obtained using simple weight-decay with four non-

linear hidden units was .317. This required a carefully

chosen penalty coefficient for the squared weights that

corresponds to uf/a~ in equation 4. To set this weight-

decay coefficient appropriately it was necessary to try

many different values on a portion of the training set

and to use the remainder of the training set to decide

which coefficient gave the best generalization. Once the

beat coefficient had been determined the whole of the

training set was used with this coefficient. A lower er-

ror of 0.291 can be achieved using weight-decay if we

gradually increase the weight-decay coefficient and pick

the value that gives optimal performance on the test

data. But this is cheating. Linear regression gave a

huge relative error of 35.6 (gross overfitting) but this

fell to 0.291 when we penalized the sum of the squarea

of the regression coefficients by an amount that was ch~

sen to optimize performance on the test data. This is

almost identical to the performance with 4 hidden units

and optimal weight-decay probably because, with small

weights, the hidden units operate in their central linear

range, so the whole network is effectively linear.

[Hinton and van Camp 1993]



Variational inference adapts ideas from statistical physics to probabilistic

inference. Arguably, it began in the late eighties with Peterson and

Anderson (1987), who used mean-ﬁeld methods to ﬁt a neural network.



This idea was picked up by Jordan’s lab in the early 1990s—Tommi

Jaakkola, Lawrence Saul, Zoubin Gharamani—who generalized it to

many probabilistic models. (A review paper is Jordan et al., 1999.)



In parallel, Hinton and Van Camp (1993) also developed mean-ﬁeld for

neural networks. Neal and Hinton (1993) connected this idea to the EM

algorithm, which lead to further variational methods for mixtures of

experts (Waterhouse et al., 1996) and HMMs (MacKay, 1997).

Today

(a) Learned Frey Face manifold (b) Learned MNIST manifold

Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latent

space, learned with AEVB. Since the prior of the latent space is Gaussian, linearly spaced coor-

dinates on the unit square were transformed through the inverse CDF of the Gaussian to produce

values of the latent variables z. For each of these values z, we plotted the corresponding generative

✓

(x|z) with the learned parameters ✓.

(a) 2-D latent space (b) 5-D latent space (c) 10-D latent space (d) 20-D latent space

Figure 5: Random samples from learned generative models of MNIST for different dimensionalities

of latent space.

B Solution of D



(z)||p

✓

(z)), Gaussian case

The variational lower bound (the objective to be maximized) contains a KL term that can often be

integrated analytically. Here we give the solution when both the prior p

✓

(z)=N(0, I) and the

posterior approximation q



(z|x

(i)

) are Gaussian. Let J be the dimensionality of z. Let µ and 

denote the variational mean and s.d. evaluated at datapoint i, and let µ

and 

simply denote the

j-th element of these vectors. Then:

✓

(z) log p(z) dz =

N(z; µ, 

) log N(z; 0, I) dz

= 

log(2⇡) 

j=1

(µ

+ 

)

Stochastic Back-propagation in DLGMs

(a) NORB (b) CIFAR (c) Frey

Figure 4. a) Performance on the NORB dataset. Left: S a mp l es from the training data. Right: sampled pixel means from

the model. b) Performance on CIFAR10 patches. Left: Samples from the training data. Right: Sampled pixel means

from the model. c) Frey faces data. Left: data samples. Right: model samples.

Figure 5. Imputation results on MNIST digits. The ﬁrst

column shows the true data. Column 2 shows pixel loca-

tions set as missing in grey. The remaining columns show

imputations a n d denoising of the i ma g es for 15 iterations,

starting left to right. Top: 60% missingness. Middle: 80%

missingness. Bottom: 5x5 patch missing.

matics and experimental design. We show the ability

of the model to impute missing data using the MNIST

data set in ﬁgure 5. We test the imputation ability

under two di↵erent missingness types (Little & Rubin,

1987): Missing-at-random (MAR), where we consider

60% and 80% of the pixels to be missing randomly, and

Not Missing-at-random (NMAR), where we consider a

square region of the image to be missing. The model

produces very good completions in both test cases.

There is uncertainty in the identity of the image. This

is expected and reﬂected in the errors in these comple-

tions as the resampling procedure is run, and furth er

demonstrates the ability of the model to capture the

diversity of the und er l yi n g data. We do not integrate

over the missing values in our imputation procedure,

but use a procedure that simulates a Markov chain

that we show converges to the true marginal distribu-

tion. The procedure to sample from the missing pixels

given the observed pixels is explained in appendix E.

Figure 6. Two dimensional embedding of the MNIST data

set. Each colour correspo n d s to one of the digit classes.

6.5. Data Visualisation

Latent variable models such as DLGMs are often used

for visualisation of high-dimensional data sets. We

project the MNIST data set to a 2-dimensional latent

space and use this 2-D embedding as a visualisation of

the data. A 2-dimensional embedding of the MNIST

data set is shown in ﬁgure 6. The classes separate

into di↵erent regions indicating that such a tool can

be useful in gaining insight into the structure of high-

dimensional data sets.

7. Discussion

Our algorithm generalises to a large class of models

with continuous latent variables, which include Gaus-

sian, non-negative or sparsity-promoting latent vari-

ables. For models with discrete latent variables (e.g.,

sigmoid belief networks), policy-gradient approaches

that improve upon the REINFORCE approach remain

the most general, but intelligent design is needed to

control t h e gradient-variance in high dimensional set-

tings.

These models are typically used with a large number

[Kingma and Welling 2013] [Rezende et al. 2014]

✓

˛ D 1:5;  D 1

data {

int N; // number of observations

int x[N]; // discrete - valued observations

}

parameters {

// latent variable , must be positive

real < lower=0> th eta ;

}

model {

// non - conjugate prior for latent variable

theta ~ weibul l (1.5 , 1) ;

// likelihood

for (n in 1:N)

x[n] ~ poisson(theta);

}

Figure 2: Specifying a simple nonconjugate probability model in Stan.

analysis posits a prior density p.✓/ on the latent variables. Combining the likelihood with the prior

gives the joint density p.X; ✓/ D p.X j ✓/p.✓/.

We focus on approximate inference for diﬀerentiable probability models. These models have contin-

uous latent variables ✓. They also have a gradient of the log-joint with respect to the latent variables

✓

logp.X; ✓/. The gradient is valid within the support of the prior supp.p.✓ // D

✓ j ✓ 2

and p.✓/>0

✓ R

, where K is the dimension of the latent variable space. This support set

is important: it determines the support of the posterior density and plays a key role later in the paper.

We make no assumptions about conjugacy, either full or conditional.

For example, consider a model that contains a Poisson likelihood with unknown rate, p.x j ✓ /. The

observed variable x is discrete; the latent rate ✓ is continuous and positive. Place a Weibull prior

on ✓, deﬁned over the positive real numbers. The resulting joint density describes a nonconjugate

diﬀerentiable probability model. (SeeFigure 2.) Its partial derivative@=@✓ p.x; ✓ / is valid within the

support of the Weibull distribution, supp.p.✓ // D R

⇢ R. Because this model is nonconjugate, the

posterior is not a Weibull distribution. This presents a challenge for classical variational inference.

In Section 2.3, we will see how  handles this model.

Manymachinelearning models are diﬀerentiable. Forexample: linear and logisticregression, matrix

factorization with continuous or discrete measurements, linear dynamical systems, and Gaussian pro-

cesses. Mixture models, hidden Markov models, and topic models have discrete random variables.

Marginalizing out these discrete variables renders these models diﬀerentiable. (We show an example

in Section 3.3.) However, marginalization is not tractable for all models, such as the Ising model,

sigmoid belief networks, and (untruncated) Bayesian nonparametric models.

2.2 Variational Inference

Bayesian inference requires the posterior density p.✓ j X/, which describes how the latent variables

vary when conditioned on a set of obser vations X. Many posterior densities are intractable because

their normalization constants lack closed forms. Thus, we seek to approximate the posterior.

Consider an approximating density q.✓ I / parameterized by . We make no assumptions about its

shape or support. We want to ﬁnd the parameters of q.✓ I / to best match the posterior according to

some loss function. Variational inference () minimizes the Kullback-Leibler () divergence from

the approximation to the posterior [2],



⇤

D argmin



q.✓ I / k p.✓ j X/

: (1)

Typically the  divergencealso lacks a closed form. Instead we maximize the evidence lower bound

(), a proxy to the  divergence,

L./ D E

q.✓/

⇥

logp.X; ✓/

⇤

E

q.✓/

⇥

logq.✓ I /

⇤

The ﬁrst term is an expectation of the joint density under the approximation, and the second is the

entropy of the variational density. Maximizing the  minimizes the  divergence [1, 16].

The posterior of a fully conjugate model is in the same family as the prior; a conditionally conjugate model

has this property within the complete conditionals of the model [3].

[Kucukelbir et al. 2015]



There is now a ﬂurry of new work on variational inference, making it

scalable, easier to derive, faster, more accurate, and applying it to more

complicated models and applications.



Modern VI touches many important areas: probabilistic programming,

reinforcement learning, neural networks, convex optimization, Bayesian

statistics, and myriad applications.



Our goal today is to teach you the basics, explain some of the newer ideas,

and to suggest open areas of new research.

剩余161页未读，继续阅读

u012436149

粉丝: 2613

变分推断基础与现代方法：2016 NIPS 教程

NIPS2016的论文

pySPACE_nips.pdf

2009_NIPS_Fast Image Deconvolution using Hyper-Laplacian Priors(ori)

matlab精度检验代码-DRLR_NIPS2019_exp:MATLAB源代码

(Translate)2009_NIPS_Fast Image Deconvolution using Hyper-Laplacian Priors.pdf

Python_NIPS 2023官方实现DiffSketcher文本引导矢量草图合成通过潜在扩散模型.zip

Matlab_NIPS 2015论文的Matlab代码和补充材料用于序列建模的深度时序s型信念网络.zip

NIPS VI Tutorial

GitHub - soumith_ganhacks_ starter from _How to Train a GAN__ at NIPS2016.pdf

NIPS 2016 Tutorial: Generative Adversarial Networks , by Ian Goodfellow

最新资源