Discovery of Rare Sequential Topic Patterns in Document Stream
∗
Zhongyi Hu
‡
Hongan Wang
†‡
Jiaqi Zhu
‡¶
Maozhen Li
§
Ying Qiao
‡
Changzhi Deng
‡
Abstract
Plain text documents created and distributed on the
Internet are ever changing in various forms. Mining topics
of these documents has significant applications in many do-
mains. Most of the literature is devoted to top ic modeling,
while sequential patterns of topics in document streams are
ignored. Moreover, traditional sequential pattern mining al-
gorithms mainly focused on frequ ent patterns for determin-
istic data sets, and thus not suitable for document streams
with topic uncertainty and rare patterns. In this paper, we
formulate and handle the mining problem of rare Sequen-
tial Topic Patterns (STPs) for Internet document streams,
which are rare on th e whole but relatively often for specific
users, so also interesting. Since this type of rare STPs re-
flects users’ specific behaviors, our work can be applied in
many fields, such as personalized context-aware recommen-
dation and real-time monitoring on abnormal user behaviors
on the Internet. We propose a novel approach to discover-
ing user-related rare STPs based on the temporal and prob-
abilistic information of concerned topics. After extracting
topics from docu ments by LDA and sorting the document
stream into sessions for different users during different time
periods, the proposed algorithms discover rare STPs by ( 1)
mining STP candidates for each user through an efficient al-
gorithm based on pattern-growth, and (2) generating user-
related rare STPs by pattern rarity analysis. Experiments
on both sy nthetic and real data sets show that our approach
can discover interesting rare STPs very effectively and effi-
ciently.
∗
This work is supported by the National Key Basic Re-
search Program of China (973 Program, No.2013CB329305),
the State Key Program of National Natural Science Founda-
tion of China (NSFC-61232013), National Natural Science Foun-
dation of China (NSFC-61202217), the National High Technol-
ogy Research and Development Program of China (863 Program,
No.2012AA040904), the National Key Technology R&D Program
(2012BAK02B00), and the program from Institute of Software,
Chinese Academy of Sciences (ISCAS2009-JQ03).
†
State Key Lab oratory of Computer Science, Institute of
Software, Chinese Academy of Sciences.
‡
Beijing Key Laboratory of Human-Computer Interaction,
Institute of Software, Chinese Academy of Sciences.
§
School of Engineering and Design, Brunel University.
¶
The corresponding author. Email: zhujq@ios.ac.cn
1 Introduction.
Document streams are generated in various forms
on the Internet, such as news streams, emails, micro-
blog articles, instant messages, res earch paper archives,
web forum discussion threads, and so fo rth. These doc-
ument streams generally concentrate on spec ific top-
ics. For example, people in the same socia l community
may talk about some common topics or discuss some
public or private events on the web. So far, most of
text mining research focused on finding topic s in docu-
ment strea ms . Topics can be extr acted from the str eam
involving both semantic a nd temporal information by
various topic modeling methods [5, 6, 18, 24]. Appar-
ently, there may be some correlations among these ob-
tained topics in successive documents for a specific user,
and these correlations could be described by Sequen-
tial Topic Patterns (STPs). Since capturing both
topic combinations and their orders, STPs serve well
as discriminative units of semantic association in am-
biguous situatio ns. Moreover, the abstract and proba-
bilistic description of topics can help to solve the cold
start problem and reach high confidence level in pattern
matching.
Some STPs occur frequently in a document stream
and thus reflect common behaviors of users. Besides,
there are still some others which are rare for the gene ral
population, but occur rela tively often for some specific
user or some specific group of users. Compared to
frequent ones, mining these user-related rare STPs
is more interesting. Theoretically, it defines a new kind
of patterns for event mining, which can characteriz e
those individual and personalized behaviors in a certain
context. Practically, it can be applied in many real-life
scenarios , as illustrated in the following two examples.
E
XAMPLE 1. Personalized context-aware rec-
ommendation. Traditional recommendation systems
have been extensively used to make recommendations
based on users’ history of preferences. However, in some
applications, they failed to consider users’ current situa-
tions and thus neglected the different preferences of users
in different contexts. For example, when a user visits a
web site, the context is reflected in the sequence of doc-
uments which the user has clicked and read in his/her
533
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.