Parameterestimationfortextanalysis

5星 · 超过95%的资源需积分: 9 83 浏览量更新于2023-03-16 评论收藏 1.79MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

Parameter estimation for text analysis

Gregor Heinrich

Technical Note

vsonix GmbH + University of Leipzig, Germany

gregor@vsonix.com

Abstract. Presents parameter estimation methods common with discrete proba-

bility distributions, which is of particular interest in text modeling. Starting with

maximum likelihood, a posteriori and Bayesian estimation, central concepts like

conjugate distributions and Bayesian networks are reviewed. As an application,

the model of latent Dirichlet allocation (LDA) is explained in detail with a full

derivation of an approximate inference algorithm based on Gibbs sampling, in-

cluding a discussion of Dirichlet hyperparameter estimation.

History: version 1: May 2005, version 2.4: August 2008.

1 Introduction

This technical note is intended to review the foundations of Bayesian parameter esti-

mation in the discrete domain, which is necessary to understand the inner workings of

topic-based text analysis approaches like probabilistic latent semantic analysis (PLSA)

[Hofm99], latent Dirichlet allocation (LDA) [BNJ02] and other mixture models of

count data. Despite their general acceptance in the research community, it appears that

there is no common book or introductory paper that ﬁlls this role: Most known texts use

examples from the Gaussian domain, where formulations appear to be rather diﬀerent.

Other very good introductory work on topic models (e.g., [StGr07]) skips details of

algorithms and other background for clarity of presentation.

We therefore will systematically introduce the basic concepts of parameter estima-

tion with a couple of simple examples on binary data in Section 2. We then will in-

troduce the concept of conjugacy along with a review of the most common probability

distributions needed in the text domain in Section 3. The joint presentation of conjugacy

with associated real-world conjugate pairs directly justiﬁes the choice of distributions

introduced. Section 4 will introduce Bayesian networks as a graphical language to de-

scribe systems via their probabilistic models.

With these basic concepts, we present the idea of latent Dirichlet allocation (LDA)

in Section 5, a ﬂexible model to estimate the properties of text. On the example of

LDA, the usage of Gibbs sampling is shown as a straight-forward means of approximate

inference in Bayesian networks. Two other important aspects of LDA are discussed

afterwards: In Section 6, the inﬂuence of LDA hyperparameters is discussed and an

estimation method proposed, and in Section 7, methods are presented to analyse LDA

models for querying and evaluation.

The probability of a new observation ˜x given the data X can now be found using the

approximation

p( ˜x|X) =

ϑ∈Θ

p( ˜x|ϑ) p(ϑ|X) dϑ (6)

≈

ϑ∈Θ

p( ˜x|

) p(ϑ|X) dϑ = p( ˜x|

), (7)

that is, the next sample is anticipated to be distributed with the estimated parameters

As an example, consider a set C of N Bernoulli experiments with unknown param-

eter p, e.g., realised by tossing a deformed coin. The Bernoulli density function for the

r.v. C for one experiment is:

p(C=c|p) = p

(1 − p)

1−c

, Bern(c|p) (8)

where we deﬁne c=1 for heads and c=0 for tails

Building an ML estimator for the parameter p can be done by expressing the (log)

likelihood as a function of the data:

L = log

i=1

p(C=c

|p) =

i=1

log p(C=c

|p) (9)

= n

(1)

log p(C=1|p) + n

(0)

log p(C=0|p)

= n

(1)

log p + n

(0)

log(1 − p) (10)

where n

(c)

is the number of times a Bernoulli experiment yielded event c. Diﬀerentiating

with respect to (w.r.t.) the parameter p yields:

∂L

∂p

(1)

−

(0)

1 − p

= 0 ⇔ ˆp

(1)

+ n

(0)

(1)

, (11)

which is simply the ratio of heads results to the total number of samples. To put some

numbers into the example, we could imagine that our coin is strongly deformed, and

after 20 trials, we have n

(1)

=12 times heads and n

(0)

=8 times tails. This results in an ML

estimation of of ˆp

= 12/20 = 0.6.

2.2 Maximum a posteriori estimation

Maximum a posteriori (MAP) estimation is very similar to ML estimation but allows

to include some a priori belief on the parameters by weighting them with a prior dis-

tribution p(ϑ). The name derives from the objective to maximise the posterior of the

parameters given the data:

MAP

= argmax

p(ϑ|X). (12)

The ML estimate

is considered a constant, and the integral over the parameters given the

data is the total probability that integrates to one.

The notation in Eq. 8 is somewhat peculiar because it makes use of the values of c to “ﬁlter”

the respective parts in the density function and additionally uses these numbers to represent

disjoint events.

∫p(θ|X) dθ = 1

P就是7式中的θ

By using Bayes’ rule (Eq. 1), this can be rewritten to:

MAP

= argmax

p(X|ϑ)p(ϑ)

p(X)



p(X) , f (ϑ)

= argmax

p(X|ϑ)p(ϑ) = argmax

{L(ϑ|X) + log p(ϑ)}

= argmax

x∈X

log p(x|ϑ) + log p(ϑ)

. (13)

Compared to Eq. 4, a prior distribution is added to the likelihood. In practice, the prior

p(ϑ) can be used to encode extra knowledge as well as to prevent overﬁtting by enforc-

ing preference to simpler models, which is also called Occam’s razor

With the incorporation of p(ϑ), MAP follows the Bayesian approach to data mod-

elling where the parameters ϑ are thought of as r.v.s. With priors that are parametrised

themselves, i.e., p(ϑ) := p(ϑ|α) with hyperparameters α, the belief in the anticipated

values of ϑ can be expressed within the framework of probability

, and a hierarchy of

parameters is created.

MAP parameter estimates can be found by maximising the term L(ϑ|X) + log p(ϑ),

similar to Eq. 5. Analogous to Eq. 7, the probability of a new observation, ˜x, given the

data, X, can be approximated using:

p( ˜x|X) ≈

ϑ∈Θ

p( ˜x|

MAP

) p(ϑ|X) dϑ = p( ˜x|

MAP

). (14)

Returning to the simplistic demonstration on ML, we can give an example for the

MAP estimator. Consider the above experiment, but now there are values for p that

we believe to be more likely, e.g., we believe that a coin usually is fair. This can be

expressed as a prior distribution that has a high probability around 0.5. We choose the

beta distribution:

p(p|α, β) =

B(α, β)

α−1

(1 − p)

β−1

, Beta(p|α, β), (15)

with the beta function B(α, β) =

Γ(α)Γ(β)

Γ(α+β)

. The function Γ(x) is the Gamma function,

which can be understood as a generalisation of the factorial to the domain of real num-

bers via the identity x! = Γ(x + 1). The beta distribution supports the interval [0,1] and

therefore is useful to generate normalised probability values. For a graphical represen-

tation of the beta probability density function (pdf), see Fig. 1. As can be seen, with

diﬀerent parameters the distribution takes on quite diﬀerent pdfs.

In our example, we believe in a fair coin and set α = β = 5, which results in a

distribution with a mode (maximum) at 0.5. The optimisation problem now becomes

Pluralitas non est ponenda sine necessitate = Plurality should not be posited without necessity.

Occam’s razor is also called the principle of parsimony.

Belief is not identical to probability, which is one of the reasons why Bayesian approaches are

disputed by some theorists despite their practical importance.

与θ无关的去掉

奥卡姆剃刀常用于两种假说的取舍上：如果对于同一现象

有两种不同的假说，我们应该采取比较简单的那一种。

自随机变量？

归一化概率

θ<=>P

已知P(θ)的两点分布？

剩余30页未读，继续阅读

YANG1248

2017-10-22

关于文本分析的很经典的资料。很有帮助！

-柚子皮-

粉丝: 1w+
资源: 98

会员权益专享

Parameter estimation for text analysis

评论2

会员权益专享

最新资源

Parameter estimation for text analysis

评论2

Parameter estimation for text analysis.pdf

Parameter-estimation-for-text-analysis， 2009版

Parameter Estimation Toolbox

DDPM是指Diffusion-Driven Probability Model还是Differentiable Density Parameter Estimation

Differentiable Density Parameter Estimation和Diffusion-Driven Probability Model差异及联系

parameter estimation

few-shot object detection and viewpoint estimation for objects in the wild

完整版代码复现PARAFAC-Based Channel Estimation for Intelligent Reflective Surface Assisted MIMO System

zero-reference deep curve estimation for low-light image enhancement代码

adaptive lasso logistic

Xu D, Anguelov D, Jain A. Pointfusion: Deep sensor fusion for 3d bounding box estimation国标引用格式

代码复现PARAFAC-Based Channel Estimation for Intelligent Reflective Surface Assisted MIMO System

Improvement of transportation cost estimation for prefabricated construction using geo-fence-based large-scale GPS data feature extraction and support vector regression在哪获取原文

deterministic maximum likehood for doa estimation

stata factor

Python课程设计 课设 手写数字识别卷积神经网络源码+文档说明.zip

会员权益专享

最新资源

Python课程设计课设手写数字识别卷积神经网络源码+文档说明.zip