第 35 卷 第 12 期 自 动 化 学 报 Vol. 35, No. 12
2009 年 12 月 ACTA AUTOMATICA SINICA December, 2009
基于 LDA 模型的主题分析
石 晶
1
范 猛
2
李万龙
1, 3
摘 要 在文本分割的基础上, 确定片段主题, 进而总结全文的中心主
题, 使文本的主题脉络呈现出来, 主题以词串的形式表示. 为了分析准
确, 利用 LDA (Latent dirichlet allocation) 为语料库及文本建模, 以
Clarity 度量块间相似性, 并通过局部最小值识别片段边界. 依据词汇的
香农信息提取片段主题词, 采取背景词汇聚类及主题词联想的方式将主
题词扩充到待分析文本之外, 尝试挖掘隐藏于字词表面之下的文本内涵.
实验表明, 文本分析的结果明显好于其他方法, 可以为下一步文本推理的
工作提供有价值的预处理.
关键词 主题分析, LDA 模型, 文本分割, Gibbs 抽样
中图分类号 TP301
Topic Analysis Based on LDA Model
SHI Jing
1
FAN Meng
2
LI Wan-Long
1, 3
Abstract Topic spotting of segments is performed based on
text segmentation and the main topic of the whole text is then
generalized. Topics are represented by means of word clusters.
LDA (Latent dirichlet allocation) is used to model corpora and
text. Clarity is taken as a metric for similarity of blocks and
segmentation points are identified by local minimum. The topic
words of segments are extracted according to Shannon informa-
tion. Words which are not distinctly in the analyzed text can be
included to express the topics with the help of word clustering
of background and topic words association. The signification
behind the words are attempted to be digged out. Experiments
tell that the result of analyzing is far better than those of other
methods. Valuable pre-processing is provided for text reasoning.
Key words Topic analysis, latent dirichlet allocation (LDA)
model, text segmentation, Gibbs sampling
文本的主题分析旨在确定一个文本的主题结构, 即识别
所讨论的主题, 界定主题的外延, 跟踪主题的转换, 觉察主题
间的关系等, 分析结果对于信息提取、文摘自动生成、文本分
类等领域都有极为重要的价值. 主题分析的程度随着应用对
象的不同有所区别, 浅层次的分析仅仅确定主题边界 (文本
分割)
[1−2]
, 或者进而指明不同片段间的关系 (是否讨论同一
主题)
[3]
; 比较复杂的分析能够在识别边界的基础上讨论主题
的内容
[4]
. 作为文本推理的预处理, 本文研究如何将边界计
算及主题表示集中在 LDA (Latent dirichlet allocation) 模
型的框架下统一实现.
欲利用统计的方法分析文本, 首先必须选择合适的模
型. 文献 [4] 以不附加任何统计假设的有限混合模型 (Fi-
nite mixture model) 代表 文本 中 的词 汇分 布, 直接 利用
EM(Exp ectation maximization) 对其进行训练, 导致的问题
收稿日期 2008-07-16 收修改稿日期 2009-03-25
Received July 16, 2008; in revised form March 25, 2009
长春工业大学博士基金 (2008A02) 资助
Supported by Changchun Technology University Do ctoral Program
(2008A02)
1. 长春工业大学计算机科学与工程学院 长春 130012 2. 长春工业大学科研
处 长春 130012 3. 吉林大学计算机科学与技术学院 长春 130012
1. College of Computer Science and Engineering, Changchun Uni-
versity of Technology, Changchun 130012 2. Department of Sci-
ence and Research Administration, Changchun University of Tech-
nology, Changchun 130012 3. College of Computer Science and
Technology, Jilin University, Changchun 130012
DOI: 10.3724/SP.J.1004.2009.01586
15 He D H, Chick S E, Chen C H. Opportunity cost and OCBA
selection procedures in ordinal optimization for a fixed num-
ber of alternative systems. IEEE Transactions on Systems,
Man, and Cybernetics, Part C: Applications and Reviews,
2007, 37(5): 951−961
16 Storn R, Price K. Differential evolution — a simple and
efficient heuristic for global optimization over continu-
ous spaces. Journal of Global Optimization, 1997, 11(4):
341−359
17 Feoktistov V. Differential Evolution: In Search of Solutions.
Berlin: Springer, 2006
18 Zhou Yan-Ping, Gu Xing-Sheng. Development of differential
evolution algorithm. Control and Instruments in Chemical
Industry, 2007, 34(3): 1−5
(周艳平, 顾幸生. 差分进化算法研究进展. 化工自动化及仪表, 2007,
34(3): 1 −5)
19 Pan H, Wang L, Liu B. Particle swarm optimization for func-
tion optimization in noisy environment. Applied Mathemat-
ics and Computation, 2006, 181(2): 908−919
20 Nowicki E, Smutnicki C. Some aspects of scatter search in
the flow-shop problem. European Journal of Operational Re-
search, 2006, 169(2): 654−666
21 Qian B, Wang L, Hu R, Wang W L, Huang D X, Wang X.
A hybrid differential evolution method for permutation flow-
shop scheduling. International Journal of Advanced Manu-
facturing Technology, 2008, 38(7-8): 757−777
22 Wang Ling, Liu Bo. Particle Swarm Optimization and
Scheduling Algorithms. Beijing: Tsinghua University Press,
2008
(王凌, 刘波. 微粒群优化与调度算法. 北京: 清华大学出版社, 2008)
23 Deng M, Ho Y C. Iterative ordinal optimization and its appli-
cations. In: Proceedings of the 36th IEEE Conference on De-
cision and Control. San Diego, USA: IEEE, 1997. 3562−3567
24 Dai L. Convergence properties of ordinal comparison in the
simulation of discrete event dynamic systems. Journal of Op-
timization Theory and Applications, 1996, 91(2): 363−388
25 Liu X L. Introduction to Statistics Theory. Beijing: Ts-
inghua University Press, 1998
26 Schiavinotto T, St¨utzle T. A review of metrics on permuta-
tions for search landscape analysis. Computers and Opera-
tions Research, 2007, 34(10): 3143−3153
27 Reeves C R. A genetic algorithm for flowshop sequencing.
Computers and Operations Research, 1995, 22(1): 5−13
胡 蓉 昆明理工大学副教授. 主要研究方向为机器学习、生产计划与
调度. 本文通信作者. E-mail: ronghu@vip.163.com
(HU Rong Associate professor at Kunming University of Sci-
ence and Technology. Her research interest covers machine learn-
ing, production planning and scheduling. Corresponding author
of this paper.)
钱 斌 博士, 昆明理工大学讲师. 主要研究方向为复杂生产过程调度
理论与方法. E-mail: bin.qian@vip.163.com
(QIAN Bin Ph. D., lecturer at Kunming University of Sci-
ence and Technology. His research interest covers scheduling
theory and algorithms for complex production process.)