An Approach for Discovering User Similarity in
Social Networks Based on the Bayesian Network and
MapReduce
Juan Xu
1
, Kun Yue (Corresponding author)
1
, Jin Li
2
, Feng Wang
3
, Weiyi Liu
1
1
School of Information Science and Engineering, Yunnan University, Kunming, China
kyue@ynu.edu.cn
2
School of Software, Yunnan University, Kunming, China
3
Yunnan Computer Technology Application Key Lab, Kunming University of Science and Technology, Kunming, China
Abstract—Adopting Bayesian network (BN) as the effective
framework for representing and inferring dependencies and
uncertainties among variables, in this paper, we established a
BN-based model to discover user similarities in social networks.
First, we built a BN to describe the direct similarity relationships
between users, called social user BN and abbreviated as SUBN.
Second, we proposed a distributed storage method based on
Hbase to store the SUBN and support the efficient probabilistic
inferences. Consequently, we proposed a SUBN-based method to
find indirect similarity relationships between users. Experimental
results show the efficiency and accuracy of our method.
Keywords—Social network, User similarity, Bayesian network
(BN), Hbase, MapReduce, Probabilistic inference
I. INTRODUCTION
With the rapid development of Web 2.0 applications and
social networks, social media have been regarded as highly
valuable as they can broaden our horizons in keeping track of
the life, in learning about societies, or simply, in making
advertising more profitable [1]. Recently, many researchers
have conducted various studies upon social networks, such as
community evolution [2] and product recommendation [3].
Among all of these social network studies, finding similar users
can help understand how user relationships will evolve, and
what paths should be taken to spread specific news/ads/
political views or which factors should be targeted for these
scenarios [4]. Generally, as an important kind of relationship
among social users, user similarities in social network reflect
user preference in social activities, and can be well used in
product recommendation, since similar users tend to share their
opinions together and the recommendations from similar users
are more willing to trust [3]. User similarities can be also used
to explore one user’s evaluation of another [5] and discover
user’s topic and role. Therefore, user similarity establishes the
basis for various perspectives of social network analysis.
Without loss of generality, user similarities in social
network are always implied and reflected from the activities of
their social interactions, which we call transactions analogous
to those in frequent pattern mining. For example, the
similarities among co-authors of academic papers are implied
and reflected from the DBLP records [6] (i.e., transactions).
Multiple users may relate to various products in the Epinions
social network [7], and multiple users may be connected to the
same video in the Youtube social network [8].
In recent years, many researchers proposed various
methods for discovering the similarities of social network users
from user interaction data by means of collaborative filtering,
link-based, and score-based ideas [9, 10]. However, when
confronted with massive transaction datasets, the accuracy and
scalability of the above methods cannot be guaranteed. To this
end, we consider the following two aspects. On one hand from
the inherence of similarity, uncertainties are doomed both in
representation and derivation of user similarities. On the other
hand from the rapid expansion of realistic social network
applications, large scale transactions of user interactions are
necessary to be retrieved efficiently. Therefore, it is natural to
consider discovering the similarities of social users from the
large scale transactions of historical activity interactions by
focusing on the uncertainty representation and derivation. This
is exactly the problem that we will address in this paper.
It is known that Bayesian network (BN) is the well-adopted
framework for uncertainty representation and inferences. A BN
is a directed acyclic graph (DAG), where nodes represent
random variables and edges represent dependencies among
these random variables. Each variable in a BN is associated
with a conditional probability table (CPT) to give the
probability of each parent state. By combining the graph and
probability theories, uncertainties can be represented directly
and inferred effectively. These mechanisms of uncertainty
representation and inferences make us use BN to describe and
discover the direct and indirect user similarities in this paper.
Meanwhile, MapReduce is a programming model for
processing and analyzing large data sets [11]. It not only offers
a parallel programming model, but also can process massive
data. Thus, we discuss the method for BN-based modeling and
induction of user similarities by adopting MapReduce as the
mechanism of massive transactions of social user behaviors.
Generally, the contributions of this paper are as follows:
• We propose a BN of users to represent the direct
similarity relationships between social network users,
called social user BN and abbreviated as SUBN. To
construct the SUBN from transaction datasets, we give
the MapReduce-based algorithm to obtain the DAG by