LDA Revisited: Entropy, Prior and Convergence
Jianwei Zhang
1
, Jia Zeng
1,2,3,∗
, Mingxuan Yuan
3,∗
, Weixiong Rao
4,∗
and Jianfeng Yan
1,2,∗
1
School of Computer Science and Technology, Soochow Univ ersity, Suzhou 215006, China
2
Collaborative Innovation Center of Novel Software Technology and Industrialization
3
Huawei Noah’s Ark Lab , Hong Kong
4
School of Software Engineering, Tongji University, China
∗
Corresponding Authors: zeng.jia@acm.org, yuan.mingxuan@huawei.com,
wxrao@tongji.edu.cn, yanjf@suda.edu.cn
ABSTRACT
Inference algorithms of latent Dirichlet allocation (LDA), either
for small or big data, can be broadly categorized into expectation-
maximization (EM), variational Bayes (VB) and collapsed Gibbs
sampling (GS). Looking for a unified understanding of these differ-
ent inference algorithms is currently an important open problem. In
this paper, we revisit these three algorithms from the entropy per-
spective, and show that EM can achieve the best predictive perplex-
ity (a standard performance metric for LDA accuracy) by minimiz-
ing directly the cross entropy between the observed word distribu-
tion and LDA’s predictive distribution. Moreover, EM can change
the entropy of LDA’s predictive distribution through tuning priors
of LDA, such as the Dirichlet hyperparameters and the number of
topics, to minimize the cross entropy with the observed word dis-
tribution. Finally, we propose the adaptive EM (AEM) algorithm
that converges faster and more accurate than the current state-of-
the-art SparseLDA [20] and AliasLDA [12] from small to big data
and LDA models. The core idea is that the number of active topics,
measured by the residuals between E-steps at successive iterations,
decreases significantly, leading to the amortized O(1) time com-
plexity in terms of the number of topics. The open source code of
AEM is available at GitHub.
Keywords
Latent Dirichlet allocation; entropy; adaptive EM algorithms; big
data; prior; convergence
1. INTRODUCTION
Latent Dirichlet allocation (LDA) [4] is a three-layer hierarchical
Bayesian model widely used for probabilistic topic modeling, com-
puter vision and computational biology. The collections of docu-
ments can be represented as a document-word co-occurrence ma-
trix, where each element is the number of word count in the spe-
cific document. Modeling each document as a mixture topics and
each topic as a mixture of vocabulary words, LDA assigns thematic
labels to explain non-zero elements in the document-word matrix,
segmenting observed words into several thematic groups called top-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
CIKM’16, October 24-28, 2016, Indianapolis, IN, USA
c
2016 ACM. ISBN 978-1-4503-4073-1/16/10. . . $15.00
DOI:
http://dx.doi.org/10.1145/2983323.2983794
ics. From the joint probability of latent labels and observed words,
existing inference algorithms of LDA approximately infers the pos-
terior probability of topic labels given observed words, and esti-
mate multinomial parameters for document-topic distributions and
topic-word distributions. In Bayesian view, LDA adds the Dirich-
let prior constraints on its predecessor, probabilistic latent semantic
analysis (PLSA) [10], and shows a better generalization ability for
predicting the words in unseen corpus.
In the past decade, the batch/online/parallel inference algorithms
of LDA for either small or big data mainly fall into three cate-
gories: 1) expectation-maximization (EM) [5], 2) variational Bayes
(VB) [4], and 3) collapsed Gibbs sampling [8]. For example, EM
for LDA has many variants such as batch EM [5, 3, 23], online
EM [24], and parallel EM [13]. Similarly, VB also has batch [4],
online [9] and parallel [25] variants. Among these inference algo-
rithms, GS variants [8, 15, 20, 12, 1, 21, 19] have gained signifi-
cantly more interests in both academia and industry because of its
sampling efficiency (low space and time complexities). Unfortu-
nately, there still lacks a unified understanding of these three types
of inference schemes:
• Which algorithm can achieve the best predictive performance?
• What is the effect of prior information, such as the Dirichlet
hyperparameters and the number of topics, on the predictive
performance of these algorithms?
• Which algorithm converges the fastest to the local optimum
of the LDA objective function?
Satisfactory answers to these questions may help researchers and
engineers choose the proper inference algorithms for LDA or other
probabilistic topic models in real-world applications [3, 16, 12].
In this paper, we address these questions within the entropy frame-
work. The corpus provides the observed word distribution x.LDA
can use its multinomial parameters to reconstruct the predictive
word distribution
ˆ
x. Inference algorithms aim to minimize the
Kullback-Leibler (KL) divergence between x and
ˆ
x by tuning LDA’s
multinomial parameters conditioned on the Dirichlet hyperparam-
eters. First, we show that minimizing the KL divergence is equiv-
alent to minimizing the cross entropy between x and
ˆ
x,whichis
the same with the definition of predictive perplexity [4, 3, 9, 23], a
standard performance metric for different inference algorithms of
LDA. Minimizing the cross entropy can be done directly by EM [5]
rather than VB [4] and GS [8]. Therefore, EM can learn
ˆ
x with
better generalization ability on unseen corpus than both VB and
GS. Second, as far as prior information is concerned [16], we show
that tuning Dirichlet hyperparameters can minimize the entropy of
ˆ
x, which in turn can minimize the cross entropy for the best pre-
dictive performance. Nevertheless, when the number of training