互联网文档流中的稀有顺序主题模式挖掘：一种新颖的个性化推荐与行为监控策略

62 浏览量更新于2024-08-26 收藏 396KB PDF 举报

在互联网上日益丰富的文本文档中，挖掘其主题模式对于众多领域具有重要意义。然而，当前的研究焦点主要集中在主题建模上，对文档流中主题的顺序模式挖掘关注不足。本文特别关注“稀有顺序主题模式”(Rare Sequential Topic Patterns, STPs)，这些模式虽然在整个文档流中罕见，但在特定用户群体中却可能频繁出现，因此具有很高的实用价值。传统的顺序模式挖掘算法设计初衷是针对确定性的数据集，无法适应文档流中主题的不确定性以及稀有模式。文档流中的STPs反映了用户的个性化行为，如用户浏览习惯、兴趣变化等，这使得它们在个性化推荐和异常行为检测等领域具有潜在的应用。为了有效地解决这一问题，作者提出了一个新颖的方法论：首先，该方法在对文档进行预处理，通过Latent Dirichlet Allocation (LDA)模型抽取主题后，将文档流分解为不同用户在不同时间段内的会话。接着，运用一种基于模式生长的高效算法，为每个用户挖掘出可能存在的STP候选序列，这种方法旨在寻找那些尽管在整个文档流中不常见，但在特定用户群体中频繁出现的主题组合。其次，为了进一步筛选出与用户相关的稀有STPs，算法引入了模式稀有度分析。通过对每个候选模式出现的频率和概率进行评估，区分出那些既罕见又与用户行为紧密关联的模式。这种方法不仅考虑了模式的频次，还结合了时间维度，确保发现的稀有STP具有实际的业务价值。综合实验结果显示，该方法在真实数据集上表现出极高的效率和有效性，成功地挖掘出了具有显著特征的稀有顺序主题模式。这对于个性化推荐系统的个性化推荐策略和异常行为检测系统实时监控异常用户行为具有实际意义，有助于提升用户体验和网络安全。这篇论文为互联网文档流中稀有顺序主题模式的挖掘提供了一个有力的工具，填补了现有研究的一个重要空白。

Discovery of Rare Sequential Topic Patterns in Document Stream

∗

Zhongyi Hu

‡

Hongan Wang

†‡

Jiaqi Zhu

‡¶

Maozhen Li

Ying Qiao

‡

Changzhi Deng

‡

Abstract

Plain text documents created and distributed on the

Internet are ever changing in various forms. Mining topics

of these documents has signiﬁcant applications in many do-

mains. Most of the literature is devoted to top ic modeling,

while sequential patterns of topics in document streams are

ignored. Moreover, traditional sequential pattern mining al-

gorithms mainly focused on frequ ent patterns for determin-

istic data sets, and thus not suitable for document streams

with topic uncertainty and rare patterns. In this paper, we

formulate and handle the mining problem of rare Sequen-

tial Topic Patterns (STPs) for Internet document streams,

which are rare on th e whole but relatively often for speciﬁc

users, so also interesting. Since this type of rare STPs re-

ﬂects users’ speciﬁc behaviors, our work can be applied in

many ﬁelds, such as personalized context-aware recommen-

dation and real-time monitoring on abnormal user behaviors

on the Internet. We propose a novel approach to discover-

ing user-related rare STPs based on the temporal and prob-

abilistic information of concerned topics. After extracting

topics from docu ments by LDA and sorting the document

stream into sessions for diﬀerent users during diﬀerent time

periods, the proposed algorithms discover rare STPs by ( 1)

mining STP candidates for each user through an eﬃcient al-

gorithm based on pattern-growth, and (2) generating user-

related rare STPs by pattern rarity analysis. Experiments

on both sy nthetic and real data sets show that our approach

can discover interesting rare STPs very eﬀectively and eﬃ-

ciently.

∗

This work is supported by the National Key Basic Re-

search Program of China (973 Program, No.2013CB329305),

the State Key Program of National Natural Science Founda-

tion of China (NSFC-61232013), National Natural Science Foun-

dation of China (NSFC-61202217), the National High Technol-

ogy Research and Development Program of China (863 Program,

No.2012AA040904), the National Key Technology R&D Program

(2012BAK02B00), and the program from Institute of Software,

Chinese Academy of Sciences (ISCAS2009-JQ03).

†

State Key Lab oratory of Computer Science, Institute of

Software, Chinese Academy of Sciences.

‡

Beijing Key Laboratory of Human-Computer Interaction,

Institute of Software, Chinese Academy of Sciences.

School of Engineering and Design, Brunel University.

The corresponding author. Email: zhujq@ios.ac.cn

1 Introduction.

Document streams are generated in various forms

on the Internet, such as news streams, emails, micro-

blog articles, instant messages, res earch paper archives,

web forum discussion threads, and so fo rth. These doc-

ument streams generally concentrate on spec iﬁc top-

ics. For example, people in the same socia l community

may talk about some common topics or discuss some

public or private events on the web. So far, most of

text mining research focused on ﬁnding topic s in docu-

ment strea ms . Topics can be extr acted from the str eam

involving both semantic a nd temporal information by

various topic modeling methods [5, 6, 18, 24]. Appar-

ently, there may be some correlations among these ob-

tained topics in successive documents for a speciﬁc user,

and these correlations could be described by Sequen-

tial Topic Patterns (STPs). Since capturing both

topic combinations and their orders, STPs serve well

as discriminative units of semantic association in am-

biguous situatio ns. Moreover, the abstract and proba-

bilistic description of topics can help to solve the cold

start problem and reach high conﬁdence level in pattern

matching.

Some STPs occur frequently in a document stream

and thus reﬂect common behaviors of users. Besides,

there are still some others which are rare for the gene ral

population, but occur rela tively often for some speciﬁc

user or some speciﬁc group of users. Compared to

frequent ones, mining these user-related rare STPs

is more interesting. Theoretically, it deﬁnes a new kind

of patterns for event mining, which can characteriz e

those individual and personalized behaviors in a certain

context. Practically, it can be applied in many real-life

scenarios , as illustrated in the following two examples.

XAMPLE 1. Personalized context-aware rec-

ommendation. Traditional recommendation systems

have been extensively used to make recommendations

based on users’ history of preferences. However, in some

applications, they failed to consider users’ current situa-

tions and thus neglected the diﬀerent preferences of users

in diﬀerent contexts. For example, when a user visits a

web site, the context is reﬂected in the sequence of doc-

uments which the user has clicked and read in his/her

533

Unauthorized reproduction of this article is prohibited.

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38723559

粉丝: 2

互联网文档流中的稀有顺序主题模式挖掘：一种新颖的个性化推荐与行为监控策略

在文档流中挖掘用户感知的稀有顺序主题模式

相似度查询

互联网文档流中用户感知的罕见顺序主题模式挖掘

【深度学习加速器】：如何通过Anaconda API文档利用深度学习功能？

词向量表示在文本生成中的应用

文本特征提取方法及其在密码破解中的作用

介绍TF-IDF在词袋模型中的应用

神经网络在NLP中的应用：从文本分类到机器翻译的深度解析

format在Python中的字符串操作：深入解析10大实用技巧，提升字符串处理能力

数据产品中的文本分析与自然语言处理技术

最新资源