中文微博主题检测：增量聚类方法改进

139 浏览量更新于2024-08-29 收藏 1.13MB PDF 举报

"本文提出了一种改进的基于层次聚类的中文微博主题检测方法，旨在减少噪声的影响，优化特征选择和权重计算，并采用新的评分方法过滤无关话题的推文。通过实验表明，该方法能有效地过滤大部分非主题相关推文，准确且高效地识别微博主题。这对用户和服务提供商动态发现微博热点话题具有重要意义。" 本文主要探讨的是在中文微博客（如微博）环境中，如何提高主题检测的准确性和效率。随着社交媒体的普及，微博已经成为信息传播和公众讨论的重要平台。然而，由于海量的实时信息流和大量的噪声数据，主题检测成为一项挑战。首先，文章介绍了基于层次聚类的主题检测模型。层次聚类是一种常见的数据分析技术，它通过构建一棵层次树来组织数据，使得同一层的元素具有较高的相似性。在微博主题检测中，这种方法有助于将相关的推文归类到同一主题下。为了减小噪声数据的影响，作者对特征选择和权重计算进行了优化。在处理文本数据时，特征选择是关键步骤，它涉及到哪些词或短语对于主题识别最有价值。优化这一过程可以减少无关信息对结果的干扰。同时，权重计算则涉及如何根据词频、词的重要性等因素为每个特征分配权重，以便更好地反映其在主题中的作用。接下来，文章提出了一个新评分方法，用于过滤掉与主题无关的推文。这个评分方法可能是基于词汇共现、情感分析或其他相关度量，以确保只保留与当前主题密切相关的推文。此外，文章还引入了一个改进的向量距离计算方法和中心向量更新算法。向量距离计算用于衡量两个推文之间的相似性，而中心向量更新则是聚类过程中调整类别代表性的过程。这两者的改进提高了聚类的准确性和稳定性。实验结果证明，这种改进的方法能够有效地过滤掉大多数非主题相关推文，并能准确、高效地识别出微博中的主题。这对于用户来说，意味着可以更快地找到感兴趣的热门话题；对于服务提供商，能够帮助他们实时监控和分析社会舆论，从而提供更精准的服务和推荐。最后，文章指出，微博主题检测方法的研究对于理解社交媒体趋势、舆情分析以及信息传播研究都具有重要的理论和实践意义。未来的研究可能会进一步探索如何适应动态变化的网络环境，提升算法的实时性和鲁棒性。

An Improved Topic Detection Method for

Chinese Microblog Based On Incremental

Clustering

Gongshen Liu, Kui Meng, Jing Xie

School of Information Security, Shanghai Jiao Tong University, Shanghai, China

{lgshen, mengkui}@sjtu.edu.cn; xiejing1989@gmail.com

Abstract—A topic detection model based on hierarchical

clustering for Chinese microblog is proposed in this paper.

In order to minimize the impact of noise, we optimize the

feature selection and weight computation method and use a

new scoring method to filter out those topic-unrelated

tweets. We also give an improved topic detection algorithm

which uses a new vector distance calculation method and

center vector updating method. It is shown by the

experiment that this method can filter out majority of the

topic-unrelated tweets and identify microblog topics

accurately and efficiently. The study of microblog topic

detection method can help users and service providers find

out microblog hot topics dynamically.

Index Terms—Incremental clustering; Microblog; topic

detection

I. INTRODUCTION

In recent years, microblogging services are more and

more popular. And it is slowly moving into the

mainstream. Unlike traditional blogging service,

microblogging service is based on social network. People

can share what they observe in their surroundings,

information about events, their opinions about certain

topics, and even their whereabouts updates with

microblogging. Moreover, one can also follow other

microbloggers to request their updates be delivered in real

time. Microblogging also provides many other functions

such as retweet or repost, commenting, etc. People can

retweet microblog with the “//@username:” format. The

“#hashtag#” format means the message is related to a

particularly topic. In addition, microblogs can be written

or received with a variety of computing devices, including

cell phones. It has empowered people themselves to act as

sensors or sources of data which could lead to important

pieces of information. Moreover, various metadata can be

extracted from the posts, such as location, time, and name.

Aggregate analysis of these data includes different

dimensions like space, time, theme, sentiment, network

structure etc., and gives researchers an opportunity to

understand social perceptions of people in the context of

certain events of interest.

The target of topic detection is to classify the large

amount of tweets according to their topic. Microblog topic

detection differs from traditional topic detection in three

aspects: firstly, microblogs or tweets are brief (typically

140 – 200 characters); secondly, tweet topics increase

quickly; thirdly, there are too much topic noise involved

in tweets.

Our research focus on hot tweet topic finding, related

tweets clustering, and tweet topic keyword extraction. In

this paper, we study data from Sina Weibo(one of the

most visited microblogging website in China), and

propose a topic detection method based on hierarchical

clustering for Chinese microblog. Microblog topic

detection can help users find out hot tweet topics more

effectively, and help the providers improve their

microblogging services.

II. RELATED WORK

[1] proposes an algorithm for internet public opinion

hotspot detection and analysis based on K-means and

SVM. The authors use traditional vector space model in

text expression, then perform K-means clustering and

SVM classifiers on the documents to detect internet public

opinion hotspot and classify following texts into

corresponding classes. However, K-means is sensitive to

noises, while there are many topic unrelated tweets in

microblogs. This algorithm cannot reduce such noise

influence. In fact, the algorithm is used for traditional

websites, so it is not suitable for microblog. [2] studies

characteristics of breaking news in Twitter and propose a

method to collect, group, rank and track breaking news in

Twitter. The authors index each tweet and grouped similar

tweets together. They also propose a measurement to

score each group and rank the groups according to the

score. [3] proposes a detecting method for sudden topics

on microblog based on the dynamic sliding window. The

authors use windows to extract the information with

potential sudden features, compute feature weight and

build VSM with TF-IDF function which is combined with

semantic. Then, they used improved Single-Pass

clustering algorithm to generate the final clustering. This

method is simple and accurate, but its miss rate is quite

high. Furthermore, this method only focuses on finding

sudden topics. [4] proposes a news topics mining

approach from microblog. The author uses the word

frequency and growing rate in the time window to

generate a compound weight and extract news keywords,

and then cluster keywords and detect news topic by

incremental clustering method. But the experimental

result shows that this method cannot get high precision

2313

doi:10.4304/jsw.8.9.2313-2320

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38681646

粉丝: 6
资源: 882

中文微博主题检测：增量聚类方法改进

Memory and forgetting: An improved dynamic maintenance method for case-based reasoning

An Improved Detection Algorithm of Snowflake Noise Based On Spatial Domain

An improved edge detection algorithm for depth map inpainting

An improved kernel regression method based on Taylor expansion

An Improved Method for Fingerprints’ Singular Points Detection based on Orientation Field Partition

An improved FCMBP fuzzy clustering method based on evolutionary programming

An improved computing method for the image edge detection

An Improved Backoff Algorithm for Wireless Sensor Network Based on Game Theory

An Improved Optimal Capacity Ratio Design Method for WSB/HPS System Based on Complementary Characteristics of Wind and Solar

A fast auto-focusing method of microscopic imaging based on an improved MCS algorithm

最新资源