Study on Topic Tree-based Topic Structure Modeling
Yinfei Huang
1,a,*
,Qian Chen
2,b
,
Shuhan Yuan
3,c
, Dongdong Lv
3,d
,Qi Zhang
4,3,e
1
Shanghai Stock Exchange,Shanghai,China
2
School of Computer and Information Technology,Shanxi University,Taiyuan, China
3
Department of Computer Science and Technology ,Tongji University ,Shanghai, China
4
Shenhua Helishi Information Technology Co, Ltd. Beijing,China
a
yfhuang@sse.com.cn,
b
chenqian@sxu.edu.cn,
c
bookcold@163.com,
d
lvdodo0355@126.com
e
zhangq@shenhua.cc *Corresponding author
Keywords: Text stream, topic tree, topic structure modeling.
Abstract. The topic tree-based topic structure model is proposed using five-tuple and probability
theory of ontology. Vocabularies in the glossary are presented with leaf nodes of the topic tree. The
results of simulation experiment on real news corpus eventually show that the topic similarity of
sym-KL divergence could construct a topic tree more accurately and dig the potential semantic topic
characteristics of time and space more deeply in the text stream compared with other flat topic
structure models.
Introduction
Topic detection technology in text stream includes text modeling, topic identification, topic tracking,
topic evolution and topic trend detection. Technically, topic identification in text stream depends
largely on text modeling and parameter reasoning of the text model[1]. At the earliest, the vector
space model was adopted for text modeling. Later, the classic vector space model tf-idf (term
frequency and inverse document frequency) technology was substituted with the probabilistic topic
model. A beakthrough in document set processing was made by the use of probabilistic topic model
in combination with other methods like matrix decomposition. In recent years, Bayes
non-parametric model in the domain of probability which has made great progress in theory has
been adopted for topic detection of text stream, making topic clustering and tracking of increment
possible[2,3].
In this paper, the topic structure model is organically combined with the topic detection and
evolution study on the basis of five-tuple theory and the topic tree-based topic structure model is put
forward. The experimental results show that the topic similarity of sym-KL divergence could
construct a more accurate topic tree and dig more deeply the potential semantic characteristics of
time and space of the topic in the text stream compared with other simple topic structure models.
Semantic-based Topic Structure Model–topic tree
Gruber believes that the ontology was "formal specification of a shared concept model clearly" [4],
and this definition is recognized by most people. It is clear that the ontology contains four meanings:
conceptualization, Explicit, formal and share.
Generally, ontology may be presented as a five-tuple set: O={C, R, P, I, A}, wherein C
represents a concept set; R represents a set of relation between concepts; P represents an attribute
set of concept, including the concept, name and value of an attribute; I represents an instance set
relevant to the concept; A represents an axiom set of ontology[5].
Definition 1: Concept List. A concept list refers to the list of all concepts of a concept set in the
ontology of a specific corpus identified in sequence according to the index.
Definition 2: Topic attribute. A topic attribute means an object that represents certain statistical
characteristic of a topic relevant to a domain, and it has a name and value.
Applied Mechanics and Materials Vols. 687-691 (2014) pp 1320-1323 Submitted: 26.09.2014
© (2014) Trans Tech Publications, Switzerland Accepted: 27.09.2014
doi:10.4028/www.scientific.net/AMM.687-691.1320
All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of TTP,
www.ttp.net. (ID: 222.66.175.254-21/11/14,11:03:03)