An Improved Topic Detection Method for
Chinese Microblog Based On Incremental
Clustering
Gongshen Liu, Kui Meng, Jing Xie
School of Information Security, Shanghai Jiao Tong University, Shanghai, China
{lgshen, mengkui}@sjtu.edu.cn; xiejing1989@gmail.com
Abstract—A topic detection model based on hierarchical
clustering for Chinese microblog is proposed in this paper.
In order to minimize the impact of noise, we optimize the
feature selection and weight computation method and use a
new scoring method to filter out those topic-unrelated
tweets. We also give an improved topic detection algorithm
which uses a new vector distance calculation method and
center vector updating method. It is shown by the
experiment that this method can filter out majority of the
topic-unrelated tweets and identify microblog topics
accurately and efficiently. The study of microblog topic
detection method can help users and service providers find
out microblog hot topics dynamically.
Index Terms—Incremental clustering; Microblog; topic
detection
I. INTRODUCTION
In recent years, microblogging services are more and
more popular. And it is slowly moving into the
mainstream. Unlike traditional blogging service,
microblogging service is based on social network. People
can share what they observe in their surroundings,
information about events, their opinions about certain
topics, and even their whereabouts updates with
microblogging. Moreover, one can also follow other
microbloggers to request their updates be delivered in real
time. Microblogging also provides many other functions
such as retweet or repost, commenting, etc. People can
retweet microblog with the “//@username:” format. The
“#hashtag#” format means the message is related to a
particularly topic. In addition, microblogs can be written
or received with a variety of computing devices, including
cell phones. It has empowered people themselves to act as
sensors or sources of data which could lead to important
pieces of information. Moreover, various metadata can be
extracted from the posts, such as location, time, and name.
Aggregate analysis of these data includes different
dimensions like space, time, theme, sentiment, network
structure etc., and gives researchers an opportunity to
understand social perceptions of people in the context of
certain events of interest.
The target of topic detection is to classify the large
amount of tweets according to their topic. Microblog topic
detection differs from traditional topic detection in three
aspects: firstly, microblogs or tweets are brief (typically
140 – 200 characters); secondly, tweet topics increase
quickly; thirdly, there are too much topic noise involved
in tweets.
Our research focus on hot tweet topic finding, related
tweets clustering, and tweet topic keyword extraction. In
this paper, we study data from Sina Weibo(one of the
most visited microblogging website in China), and
propose a topic detection method based on hierarchical
clustering for Chinese microblog. Microblog topic
detection can help users find out hot tweet topics more
effectively, and help the providers improve their
microblogging services.
II. RELATED WORK
[1] proposes an algorithm for internet public opinion
hotspot detection and analysis based on K-means and
SVM. The authors use traditional vector space model in
text expression, then perform K-means clustering and
SVM classifiers on the documents to detect internet public
opinion hotspot and classify following texts into
corresponding classes. However, K-means is sensitive to
noises, while there are many topic unrelated tweets in
microblogs. This algorithm cannot reduce such noise
influence. In fact, the algorithm is used for traditional
websites, so it is not suitable for microblog. [2] studies
characteristics of breaking news in Twitter and propose a
method to collect, group, rank and track breaking news in
Twitter. The authors index each tweet and grouped similar
tweets together. They also propose a measurement to
score each group and rank the groups according to the
score. [3] proposes a detecting method for sudden topics
on microblog based on the dynamic sliding window. The
authors use windows to extract the information with
potential sudden features, compute feature weight and
build VSM with TF-IDF function which is combined with
semantic. Then, they used improved Single-Pass
clustering algorithm to generate the final clustering. This
method is simple and accurate, but its miss rate is quite
high. Furthermore, this method only focuses on finding
sudden topics. [4] proposes a news topics mining
approach from microblog. The author uses the word
frequency and growing rate in the time window to
generate a compound weight and extract news keywords,
and then cluster keywords and detect news topic by
incremental clustering method. But the experimental
result shows that this method cannot get high precision
doi:10.4304/jsw.8.9.2313-2320