马尔可夫链蒙特卡洛方法在概率推理中的应用

需积分: 10 41 浏览量更新于2024-07-29 收藏 1.08MB PDF 举报

"这篇技术报告是关于使用马尔科夫链蒙特卡洛方法进行概率推理的，由Radford M. Neal撰写，来自多伦多大学计算机科学系。报告讨论了在人工智能领域中，面对不确定性推理和经验学习时，概率推理作为一种有吸引力的方法，但其计算复杂性挑战在于现实且灵活的概率模型会导致高维空间中的复杂分布。报告提到了蒙特卡洛方法，特别是基于马尔科夫链的采样技术，已经在统计物理学等领域成功解决过类似问题，并且近年来，Metropolis算法和Gibbs采样等方法在人工智能问题中也得到了应用。" 详细说明: 概率推理是一种处理不确定性和学习的有力工具，尤其在人工智能中。然而，构建出足够真实且灵活的概率模型会引入高维空间中的复杂概率分布，这给计算带来了困难。马尔科夫链蒙特卡洛（Markov Chain Monte Carlo, MCMC）方法是一种解决此类问题的有效途径。马尔科夫链是一种随机过程，其中每个状态的转移概率仅依赖于当前状态，不考虑过去的路径。在MCMC中，我们利用这样的链来在概率分布上进行采样，即使这个分布是非常复杂的。这种方法的核心思想是通过构造一个马尔科夫链，使得在足够长的时间后，链的任何状态都能代表目标分布，这种性质被称为平稳分布或平衡分布。报告中提及的Metropolis算法是MCMC的一种早期形式，它在统计物理学中有悠久的历史，用于模拟复杂的物理系统。该算法包括提出新的状态（样本）并根据接受规则决定是否接受这个新状态，以确保生成的样本序列最终能反映出目标分布。 Gibbs采样是与Metropolis算法相关的另一种MCMC方法，特别适用于具有条件独立性的概率模型。在Gibbs采样中，不是一次更新所有变量，而是逐个更新每个变量，根据其他变量的当前值来生成新值。这种方法在处理大型贝叶斯网络和高维概率分布时特别有效。这些MCMC技术为解决人工智能中的难题提供了强大的工具。例如，在机器学习中，它们被用来进行参数估计、贝叶斯网络的推理以及 posterior 分布的探索。此外，它们还能处理复杂模型中的变量融合问题，如隐马尔科夫模型（HMMs）和深度学习模型的训练。这篇报告深入探讨了如何运用马尔科夫链蒙特卡洛方法来克服概率推理中的计算挑战，特别是在人工智能领域的应用，展示了这些技术在处理高维度和复杂度的统计问题时的强大能力。

2.1 Probabilistic inference with a fully-sp ecied model

\brain injury" indicates that the former is not relevant when specifying the conditional

probability of \brain injury" given the variables preceding it. For the model to b e fully

specied, this graphical structure must, of course, b e accompanied by actual numerical

values for the relevant conditional probabilities, or for parameters that determine these.

The diseases in the middle layer of this b elief network are mostly latentvariables, invented

byphysicians to explain patterns of symptoms they have observed in patients. The symp-

toms in the bottom layer and the underlying causes in the top layer would generally b e

considered observable. Neither classication is unambiguous | one might consider micro-

scopic observation of a pathogenic microorganism as a direct observation of a disease, and,

on the other hand, \fever" could be considered a latentvariable invented to explain why

some patients have consistently high thermometer readings.

In any case, many of the variables in such a network will not, in fact, have been observed,

and inference will require a summation over all possible combinations of values for these

unobserved variables, as in equation (2.7). To nd the probability that a patient with

certain symptoms has cholera, for example, wemust sum over all possible combinations of

other diseases the patientmayhaveaswell, and over all possible combinations of underlying

causes. For a complex network, the number of such combinations will b e enormous. For

some networks with sparse connectivity, exact numerical metho ds are nevertheless feasible

(Pearl, 2:1988, Lauritzen and Spiegelhalter, 2:1988). For general networks, Markovchain

Monte Carlo metho ds are an attractive approach to handling the computational diculties

(Pearl, 4:1987).

Example: Multi-layer perceptrons.

The most widely-used class of \neural networks" are the

multi-layer perceptron

(or

backpropagation

) networks (Rumelhart, Hinton, and Williams,

2:1986). These networks can be viewed as modeling the conditional distributions for an

output vector,

, given the various p ossible values of an input vector,

. The marginal

distribution of

is not modeled, so these networks are suitable only for regression or classi-

cation applications, not (directly, at least) for applications where the full joint distribution

of the observed variables is required. Multi-layer p erceptrons have been applied to a great

variety of problems. Perhaps the most typical sorts of application take as input sensory infor-

mation of some type and from that predict some characteristic of what is sensed. (Thodberg

(2:1993), for example, predicts the fat content of meat from spectral information.)

Multi-layer p erceptrons are almost always viewed as non-parametric mo dels. They can have

avariety of architectures, in which \input", \output", and \hidden" units are arranged

and connected in various fashions, with the particular architecture (or several candidate

architectures) being chosen by the designer to t the characteristics of the problem. A

simple and common arrangementistohavea layer of input units, which connect to a layer

of hidden units, which in turn connect to a layer of output units. Such a network is shown

in Figure 2.3. Architectures with more layers, selective connectivity, shared weights on

connections, or other elaborations are also used.

The network of Figure 2.3 operates as follows. First, the input units are set to their observed

values,

;

...

.Values for the hidden units,

;

...

, and for the output

units,

;

...

, are then computed as functions of

as follows:

(

) =





(2.17)

(

) =



(

)



(2.18)

2.2 Statistical inference for model parameters



Output Units

Hidden Units

Input Units

Figure 2.3: A multi-layer p erceptron with one layer of hidden units. The input units at the bottom

are xed to their values for a particular case. The values of the hidden units are then computed,

followed by the values of the output units. The value of a unit is a function of the the weighted

sum of values received from other units connected to it via arrows.

Here,

is the weight on the connection from input unit

to hidden unit

, with

being

a \bias" weight for hidden unit

. Similarly, the

are the weights on connections into the

output units. The functions

and

are used to compute the activity of a hidden or output

unit from the weighted sum over its connections. Generally, the hidden unit function,

and p erhaps

as well, are non-linear, with

(

) = tanh(

) b eing a common choice. This

non-linearity allows the hidden units to represent \features" of the input that are useful in

computing the appropriate outputs. The hidden units thus resemble latentvariables, with

the dierence that their values can be found with certainty from the inputs (in this sense,

they are not \hidden" after all).

The conditional distribution for

;

...

given

;

...

is dened in terms

of the values of the output units computed by the network when the input units are set to

. If the

are real-valued, for example, indep endent Gaussian distributions with means of

(

) and some predetermined \noise" variance,



, might b e appropriate. The conditional

distribution would then b e

(

) =



exp





(

)



)





(2.19)

Note that the computations required for the ab ove can be performed easily in time prop or-

tional to the number of connections in the network. There is hence no need to use Monte

Carlo methods with these networks once their weights have b een determined.

2.2 Statistical inference for model parameters

The mo dels describ ed above are fully sp ecied only when the values of certain

model param-

eters

are xed | examples are the parameters



and



for the latent class mo del, and the

weights

and

for a multi-layer perceptron. Determining these parameters from em-

pirical data is a task for statistical inference, and corresponds to one concept of learning in

articial intelligence. The frequentist approach to statistics addresses this task by attempt-

ing to nd pro cedures for

estimating

the parameters that can be shown to probably produce

\goo d" results, regardless of what the true parameters are. Note that this does

not

imply

that the values actually found in any particular instance are probably go od | indeed, such

a statement is generally meaningless in this framework. In contrast, the Bayesian approach

reduces statistical inference to probabilistic inference by dening a joint distribution for

both the parameters and the observable data. Conditional on the data actually observed,

2.2 Statistical inference for model parameters

posterior

probability distributions for the parameters and for future observations can then

be obtained.

Statistical inference is applicable only when the potentially observable variables come in

groups of similar structure | each applying to a particular

case

| with the distributions

of the variables in dierent cases being related in some way. The values seen for the cases

that have b een observed | the

training cases

,inmachine learning terminology | can then

tell us something about the distribution for the unseen cases. The simplest assumption,

made for all the examples in this review, is that, given the values of the mo del parameters,

the observable (and latent) variables for one case are indep endent of the variables for the

other cases, and the distributions of these variables are the same for all cases. On this

assumption, if

;

...

are the variables for case

, and



;

...

;

are the

model parameters, we can write the distribution of the variables for all cases as

(

;

...



) =

(



) =

(

;

...



;

...

;

) (2.20)

with

(

;

...



;

...

;

) b eing a function only of the model parameters and of the

values

, not of

itself. The number of cases is considered indenite, though in any

particular instance we will be concerned only with whatever number of cases have been

observed, plus whatever number of unobserved cases wewould like to make predictions for.

(Note that the variables used to express the models of the preceding section will in this

section acquire an additional index, to distinguish the dierent cases. Also, while in the

previous section the mo del parameters were considered xed, and hence were not explicitly

noted in the formulas, in this section the distribution for the data will b e shown explicitly

to dep end on the parameter values.

)

I will use coin tossing as a simple illustrative problem of statistical inference. In this example,

each case,

, consists of just one value, representing the result of tossing a particular coin

for the

-th time, with

= 1 representing heads, and

= 0 representing tails. We mo del

the coin as having a \true" probability of landing heads given by a single real number,



in the interval [0

;

1]. The probability of a particular series of tosses,

;

...

, is then

(

;

...



) =







)









)

(2.21)

where

, i.e. the number of the

that are one (heads), and



, i.e. the

number that are zero (tails).

The machine learning literature distinguishes between

supervised

and

unsupervised

learning.

Supervised learning can be seen as statistical inference for a regression or classication

model, in which only the conditional distributions of certain variables are mo deled, whereas

unsupervised learning can (on one interpretation, at least) be seen as statistical inference

for a mo del that denes the joint probability of all observable variables.

Maximum likelihood inference.

The probability that a model with particular param-

eters values assigns to the data that has actually b een observed (with any unobserved

variables being summed over) is called the

likelihood

.For example, if cases

;

...

have

been observed in their entirety (and nothing else has been observed), then the likelihood is

I show this dep endence by writing the parameter as if it were a variable on whose value we are conditioning.

This is ne for Bayesians. Others may ob ject on philosophical grounds, but will likely not b e confused.

2.2 Statistical inference for model parameters

(



;

...

) =

(

;

...



) =

(



) (2.22)

The likelihoo d is regarded as a function of the model parameters, with given data, and is

considered signicant only up to multiplication by an arbitrary factor. It encapsulates the

relative abilities of the various parameter values to \explain" the observed data, whichmay

be considered a measure of how plausible the parameter values are in light of the data. In

itself, it does

not

dene a probability distribution over parameter values, however | for

that, one would need to introduce some measure on the parameter space as well.

The widely used

maximum likelihood

procedure estimates the parameters of the model to

be those that maximize the likelihood given the observed data. In practice, the equivalent

procedure of maximizing the log of the likelihood is usually found to be more convenient.

For the coin tossing example, the log likelihoo d function given data on

ips, obtained

from equation (2.21), is

log

(



;

...

) =

log(



log(1





) (2.23)

The maximum likelihoo d estimate for the \true" probability of heads is easily found to b e



, i.e. the frequency of heads in the observed ips.

For a large class of models, the maximum likelihood pro cedure has the frequentist justica-

tion that it converges to the true parameter values in the limit as the number of observed

cases go es to innity. This is not always the case, however, and even when it is, the quality

of such estimates when based on small amounts of data may b e p oor. One way to address

such problems is to choose instead the parameters that maximize the log likelihood plus

a penalty term, whichisintended to bias the result away from \overtted" solutions that

model the noise in the data rather the true regularities. This is the

maximum penalized

likelihood

method. The magnitude of the p enalty can be set by hand, or by the metho d of

cross validation

(for example, see (Efron, 9:1979)).

Naively, at least, predictions for unobserved cases in this framework are done using the single

estimate of the parameters found by maximizing the likelihoo d (or penalized likelihoo d).

This is not always very reasonable. For the coin tossing example, if we ip the coin three

times, and each time it lands heads, the maximum likelihoo d estimate for the probabilityof

heads is one, but the resulting prediction that on the next toss the coin is certain to land

head-up is clearly not warranted.

Example: Univariate Gaussian.

Suppose that

;

...

are indep endent, and that each

has a univariate Gaussian distribution with the same parameters,



and



, with



being

known, but



not known. We can estimate



by maximum likelihood. From equation (2.10),

the likelihoo d function can be found:

(



;

...

) =

(



) =



exp



(





)







(2.24)

Taking the logarithm, for convenience, and discarding terms that do not involve



,we get:

The denition of likelihood given here is that used by careful writers concerned with foundational issues.

Unfortunately, some Bayesians have taken to using \likelihood" as a synonym for \probability", to be

used only when referring to observed data. This has little practical import within the Bayesian school, but

erases a distinction important to those of some other schools who are happy to talk about the \likelihoo d

that



= 0", but who would never talk ab out the \probability that



= 0".

2.2 Statistical inference for model parameters

log

(



;

...

) =





(





)

(2.25)

The value of



that maximizes this is the arithmetic average of the observed values:



= 

(2.26)

One virtue of this estimate, from the frequentist persp ective, is that it is

unbiased

| for

any true value of



, the exp ected value of the estimate, ^



, is equal to the true value, the

expectation b eing taken with respect to the distribution that



denes for

;

...

Example: Multi-layer p erceptrons.

The log of the likelihood for the parameters of the multi-

layer p erceptron of Figure 2.3 (i.e. for the weight matrices

and

), given the training cases

(

)

;

...

;

(

), is

log

(

u; v

(

)

;

...

;

(

)) = log

(

;

...

;

...

;u;v

) (2.27)



(

)



)



(2.28)

where terms that do not dep end on

have been omitted, as they are not signicant.

Note that the functions

(



) do depend on

and

(see equations (2.17) and (2.18)).

The above expression does not quite have the form of (2.22) because the network do es not

attempt to model the marginal distribution of the

The ob jectiveofconventional neural network training is to minimize an \error" function

which is proportional to the negative of the ab ove log likelihoo d. Such training can thus

be viewed as maximum likelihood estimation. Since the fo cus is solely on the conditional

distribution for the

, this is an example of supervised learning.

A lo cal maximum of the likelihood of equation (2.28) can be found by gradient-based meth-

ods, using derivatives of log

with respect to the

and

obtained by the \backpropaga-

tion" method, an application of the chain rule (Rumelhart, Hinton, and Williams, 2:1986).

The likelihood is typically a very complex function of the weights, with many local max-

ima, and an enormous magnitude of variation. Perhaps surprisingly, simple gradient-based

methods are nevertheless capable of nding go od sets of weights in which the hidden units

often compute non-obvious features of the input.

Multi-layer p erceptrons are sometimes trained using \weight decay". This metho d can be

viewed as maximum p enalized likelihood estimation, with a penalty term proportional to

minus the sum of the squares of the weights. This p enalty encourages estimates in which

the weights are small, and is found empirically to reduce overtting.

Bayesian inference.

Bayesian statistical inference requires an additional input not needed

by frequentist pro cedures such as maximum likelihoo d | a

prior

probability distribution

for the parameters,

(



;

...

;

), whichemb odies our judgement, before seeing any data, of

how plausible it is that the parameters could havevalues in the various regions of parameter

space. The introduction of a prior is the crucial element that converts statistical inference

into an application of probabilistic inference.

The need for a prior is also, however, one of the principal reasons that some reject the use of

the Bayesian framework. Partly, this is because the prior can usually b e interpreted only as

剩余143页未读，继续阅读

huangzx1209

粉丝: 0
资源: 3

马尔可夫链蒙特卡洛方法在概率推理中的应用

Probabilistic Inference Using MCMC

Introducing Monte Carlo Methods with R

【Advanced】Markov Chain Monte Carlo (MCMC) Fitting in MATLAB

Probabilistic inference for obfuscated network attack sequences.

Bayesian Methods for Hackers Probabilistic Programming and Bayesian Inference

Automating Inference, Learning, and Design using probabilistic programming

Practical Probabilistic Programming(Manning,2016)

概率图模型Probabilistic Graphical Model论文集8

概率图模型Probabilistic Graphical Model论文集10

A Compilation Target for Probabilistic Programming Languages - 2014 (paige14)-计算机科学

最新资源