LDA再探：熵视角下的算法统一与自适应优化

下载需积分: 50 | PDF格式 | 951KB | 更新于2024-08-25 | 166 浏览量 | 举报

本文主要探讨了潜在狄利克雷分配（LDA）在处理小数据和大数据时的三种主要推理算法：期望最大化（EM）、变分贝叶斯（VB）和折叠吉布斯抽样（GS）。LDA是一种流行的无监督主题模型，用于文本分析，其目的是揭示文档中的隐藏主题结构。作者首先从熵的角度对这些算法进行了重新审视。熵是信息论中的一个重要概念，它衡量的是不确定性或信息的量。他们发现，EM算法实际上是在通过最小化观察到的词分布与LDA预测分布之间的交叉熵来优化预测困惑度，即预测的不确定性。困惑度越低，表示模型的预测越准确。这表明，从熵的角度理解EM算法的优化目标，有助于我们更好地评估和改进其性能。文章进一步指出，EM算法能够通过调整LDA的先验条件，如Dirichlet分布的超参数和主题数，来改变预测分布的熵。这种调整是为了使得模型的预测与实际观察的数据分布更加接近，从而提高模型的准确性。这体现了先验知识在LDA中的关键作用，以及如何通过优化先验设置来优化算法效果。接着，文章提出了一种自适应期望最大化（AEM）算法，它在小数据和大数据处理上都展现出优于当前先进算法如SparseLDA和AliasLDA的收敛速度和准确性。AEM的核心思想在于，通过监测每次迭代中E步之间的残差，动态调整活动主题的数量，这显著降低了在主题数量上的σ(1)时间复杂度。这样，即使面对大规模数据，AEM也能保持高效的性能。 AEM算法的优势在于其自适应性，能够根据数据特性自动调整，确保模型在不同规模的数据集上都能达到最优的性能。AEM的开源代码在GitHub上可供研究者和开发者使用，这促进了算法的广泛应用和进一步发展。本文深入探讨了LDA的推理算法从熵和先验的角度，强调了EM算法的优化策略，并引入了自适应EM算法作为提升LDA在大数据背景下性能的新方法。这一研究对于理解和优化LDA模型，特别是在实际应用中的性能优化具有重要意义。

展开

LDA Revisited: Entropy, Prior and Convergence

Jianwei Zhang

, Jia Zeng

1,2,3,∗

, Mingxuan Yuan

3,∗

, Weixiong Rao

4,∗

and Jianfeng Yan

1,2,∗

School of Computer Science and Technology, Soochow Univ ersity, Suzhou 215006, China

Collaborative Innovation Center of Novel Software Technology and Industrialization

Huawei Noah’s Ark Lab , Hong Kong

School of Software Engineering, Tongji University, China

∗

Corresponding Authors: zeng.jia@acm.org, yuan.mingxuan@huawei.com,

wxrao@tongji.edu.cn, yanjf@suda.edu.cn

ABSTRACT

Inference algorithms of latent Dirichlet allocation (LDA), either

for small or big data, can be broadly categorized into expectation-

maximization (EM), variational Bayes (VB) and collapsed Gibbs

sampling (GS). Looking for a uniﬁed understanding of these differ-

ent inference algorithms is currently an important open problem. In

this paper, we revisit these three algorithms from the entropy per-

spective, and show that EM can achieve the best predictive perplex-

ity (a standard performance metric for LDA accuracy) by minimiz-

ing directly the cross entropy between the observed word distribu-

tion and LDA’s predictive distribution. Moreover, EM can change

the entropy of LDA’s predictive distribution through tuning priors

of LDA, such as the Dirichlet hyperparameters and the number of

topics, to minimize the cross entropy with the observed word dis-

tribution. Finally, we propose the adaptive EM (AEM) algorithm

that converges faster and more accurate than the current state-of-

the-art SparseLDA [20] and AliasLDA [12] from small to big data

and LDA models. The core idea is that the number of active topics,

measured by the residuals between E-steps at successive iterations,

decreases signiﬁcantly, leading to the amortized O(1) time com-

plexity in terms of the number of topics. The open source code of

AEM is available at GitHub.

Keywords

Latent Dirichlet allocation; entropy; adaptive EM algorithms; big

data; prior; convergence

1. INTRODUCTION

Latent Dirichlet allocation (LDA) [4] is a three-layer hierarchical

Bayesian model widely used for probabilistic topic modeling, com-

puter vision and computational biology. The collections of docu-

ments can be represented as a document-word co-occurrence ma-

trix, where each element is the number of word count in the spe-

ciﬁc document. Modeling each document as a mixture topics and

each topic as a mixture of vocabulary words, LDA assigns thematic

labels to explain non-zero elements in the document-word matrix,

segmenting observed words into several thematic groups called top-

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

CIKM’16, October 24-28, 2016, Indianapolis, IN, USA

 2016 ACM. ISBN 978-1-4503-4073-1/16/10. . . $15.00

DOI:

http://dx.doi.org/10.1145/2983323.2983794

ics. From the joint probability of latent labels and observed words,

existing inference algorithms of LDA approximately infers the pos-

terior probability of topic labels given observed words, and esti-

mate multinomial parameters for document-topic distributions and

topic-word distributions. In Bayesian view, LDA adds the Dirich-

let prior constraints on its predecessor, probabilistic latent semantic

analysis (PLSA) [10], and shows a better generalization ability for

predicting the words in unseen corpus.

In the past decade, the batch/online/parallel inference algorithms

of LDA for either small or big data mainly fall into three cate-

gories: 1) expectation-maximization (EM) [5], 2) variational Bayes

(VB) [4], and 3) collapsed Gibbs sampling [8]. For example, EM

for LDA has many variants such as batch EM [5, 3, 23], online

EM [24], and parallel EM [13]. Similarly, VB also has batch [4],

online [9] and parallel [25] variants. Among these inference algo-

rithms, GS variants [8, 15, 20, 12, 1, 21, 19] have gained signiﬁ-

cantly more interests in both academia and industry because of its

sampling efﬁciency (low space and time complexities). Unfortu-

nately, there still lacks a uniﬁed understanding of these three types

of inference schemes:

• Which algorithm can achieve the best predictive performance?

• What is the effect of prior information, such as the Dirichlet

hyperparameters and the number of topics, on the predictive

performance of these algorithms?

• Which algorithm converges the fastest to the local optimum

of the LDA objective function?

Satisfactory answers to these questions may help researchers and

engineers choose the proper inference algorithms for LDA or other

probabilistic topic models in real-world applications [3, 16, 12].

In this paper, we address these questions within the entropy frame-

work. The corpus provides the observed word distribution x.LDA

can use its multinomial parameters to reconstruct the predictive

word distribution

x. Inference algorithms aim to minimize the

Kullback-Leibler (KL) divergence between x and

x by tuning LDA’s

multinomial parameters conditioned on the Dirichlet hyperparam-

eters. First, we show that minimizing the KL divergence is equiv-

alent to minimizing the cross entropy between x and

x,whichis

the same with the deﬁnition of predictive perplexity [4, 3, 9, 23], a

standard performance metric for different inference algorithms of

LDA. Minimizing the cross entropy can be done directly by EM [5]

rather than VB [4] and GS [8]. Therefore, EM can learn

x with

better generalization ability on unseen corpus than both VB and

GS. Second, as far as prior information is concerned [16], we show

that tuning Dirichlet hyperparameters can minimize the entropy of

x, which in turn can minimize the cross entropy for the best pre-

dictive performance. Nevertheless, when the number of training

1763

下载后可阅读完整内容，剩余9页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

weixin_38666208

粉丝: 18

LDA再探：熵视角下的算法统一与自适应优化

最优LDA模型

论文研究-LSI_LDA：一种混合特征降维方法.pdf

【LDA模型优化】：提升模型性能的7个实战策略与技巧

【LDA模型全面揭秘】：掌握主题模型从入门到高级应用的10大技巧

【飞行器红外特征提取】：模拟逼真的技术要点

半监督学习探索：在有限标签下如何最大化模型性能？

特征选择的高级武器库：4大技术揭秘，优化机器学习模型性能

江苏省培育壮大数据企业行动方案（2025-2027年）.docx

西门子200smart恒压供水系统PLC程序解析与应用

基于STM32设计的数字示波器全套资料（原理图、PCB图、源代码）

最新资源