LDA topic model for microblog recommendation
Jianyong Duan, Yamin Ai
College of computer science
North China University of Technology
Beijing, China
Email: duanjy@hotmail.com
Xia li
Key Laboratory of Language Engineering and Computing
Guangdong University of foreign Studies
Guangzhou, China
Email: helly lx@126.com
Abstract—Microblog is a browser-based platform for web
user’s information sharing and communication. With the
rapidly increasing of microblog population, its effective
recommendation function becomes necessary. This paper
proposes the recommendation by the Latent Dirichlet Al-
location topic model, which combines the user interest to
meet their needs. It also conducts a comparative analysis
between indirect and direct recommendation algorithms. The
experimental results show that the indirect recommendation
is more effective for the micro-blog recommendation.
Keywords-Social media; recommendation system; LDA
model;
I. INTRODUCTION
Microblog is a popular social media[1]. It has been
accepted by the majority of internet users. Sina micorblog,
for example, its number of monthly active users reached
129.1 million and the number of daily active users reached
61.4 million in China. At the same time, it also gradually
accumulated abundant information. How to effectively
recommend is the crucial problem[2], [3].
II. RELATED WORK
There are some research about microblog
recommendation[4], [5], such as user-related
recommendation and tag-based recommendation. The
difficulties of the recommendation is also followed.
Firstly, most microblogs have no clear topics[6], [7].
Those microblogs often describe the users’ own mood
or some irrelevant trivial things. Secondly user interest
is always changing[8]. The microblog is a platform of
rapid information dissemination. Users easily switch their
interests by their browsed information. Thus user behavior
is difficult to be captured[9]. Due to limited content of
microblog post, user may stay only a few seconds in
one topic, it is difficult to capture their preferences for
certain topic[10]. Moreover most users rarely comment
on the topics. The system can not effectively capture their
interests.
In this paper, we introduce the Latent Dirichlet Allo-
cation (LDA) for microblog topic model construction[11].
The information of micorblog is scattered into topics by
this model. Then the recommendation system effectively
accumulate the weights of user interest and found their
interests.
III. USER TOPIC MODEL CONSTRUCTION
A. The LDA topic model with user interest combination
The LDA topic model is a kind of Bayesian model[12].
It is composed of three levels, such as documents, topics
and words. A document consists of multiple topics. A topic
consists of multiple words. Then the distribution of words
in the document represented as p(word|document) =
P
topic
p(word|topic) × p(topic|document). Assume that
there are m documents and n independent words in the
document set D. Then each topic (also as theme) is
expressed as an n-dimensional vector ϕ, which is subject
to the Dirichlet distribution β.
In our LDA topic model, words layer as W =
{w
1
, w
2
, w
3
, .., w
n
}, which is the set after removing stop
words; topic layer as T = {z
1
, z
2
, z
3
, ..., z
t
} , each topic
is a set of words of the multinomial distribution, which
is subject to ϕ
i
= {q
i,1
, q
i,2
, q
i,3
, ..., q
i,n
}, (
P
j=n
q
i,j
= 1),
and q
i,j
represents the probability of a word (w
j
) in the
topic (z
i
); document layer as D = { θ
1
, θ
2
, θ
3
, .., θ
m
}, each
document is a set of topics of the multinomial distribu-
tion, which is subject to θ
d
= (p
d,1
, p
d,2
, p
d,3
, ..., p
d,n
),
P
j=n
p
i,j
= 1, where p
i,j
represents the probability of a
topic (z
j
) in the document d.
For the convenience of introducing user interest[13],
[14], the user interest is added into the LDA model
as the set U = {u
1
, u
2
, u
3
, .., u
y
}. Each user is
based on a cumulative variable θ, expressed as u
i
=
(
P
d=S
p
d,1
,
P
d=S
p
d,2
, ...,
P
d=S
p
d,n
), where S is the number
of documents which are visited by user.
B. Clustering interest topics
For avoiding repeated recommendation, we cluster the
similar interest topics and group them as single topic[15].
It improves the recommendation diversity. K-Means++ al-
gorithm is used to cluster topics[16]. It is an unsupervised
machine learning, and also has better performance than
K-Means algorithm.
Assuming that the topic set is T = {z
1
, z
2
, z
3
, ..., z
m
},
and k initial centroid of the optimized set is P =
{p
1
, p
2
, p
3
, ..., p
k
}. Then our clustering steps as following:
(1)Find the nearest centroid p
i
from each topic as
tmp
i
= min
j
kz
i
− p
j
k
2
;