SGP: Sampling Big Social Network Based on Graph Partition
Xiaolin Du, Yunming Ye
Key Laboratory of Internet Information Collaboration,
Shenzhen Graduate School, Harbin Institute of
Technology, China
duxiaolinhitsz@gmail.com
yeyunming@hit.edu.cn
Yan Li, Yueping Li
School of Computer Engineering,
Shenzhen Polytechnic,
Shenzhen, China
liyan@szpt.edu.cn
leeyueping@gmail.com
Abstract—Deriving a representative sample from a big social
network is essential for many Internet services that rely on
accurate analysis of big social data. A good sampling method
for social network should be able to generate small sample
networks with similar structures as original big network. In
this paper, we propose SGP, a new big social network
sampling algorithm based on graph partition. In SGP,
original network is firstly partitioned into several sub-
networks that will be sampled evenly. This procedure
enables SGP to effectively maintain the topological similarity
and community structure similarity between the sampled
network and its original network. We have evaluated SGP
on several well-known data sets. The experimental results
show that SGP outperforms six state-of-the-art methods.
Keywords-sampling algorithms; social networks; graph
partition; community structure; topology structure
I. INTRODUCTION
In the era of Social Web, social networks (twitter,
micro-blog, MSN, Facebook, co-citation relation, credit
network) appear everywhere. The last few years have
witnessed an explosive growth of online social networks
which have attracted most attention from all over the
world [1]. The rapid growth of social networks has
brought new challenges to the research on social networks.
The modern science of networks has brought significant
advances in our understanding of complex systems [2]. In
research, social networks are usually represented by
different types of graphs. Vertices represent entities, and
edges represent interactions between pairs of entities.
Some graph mining techniques (graph visualization
techniques, graph structure analyzing techniques, etc.) are
then employed to assist big social networks analysis.
However, for a large-scale graph with millions of vertices,
it is very difficult to use graph mining approaches to
handle the entire graph directly. Finding specific methods
to accelerate the large-scale graph mining process is an
essential issue. One popular solution is to accomplish a
sub-graph, which can represent the original graph
effectively so that we are able to use this sub-graph for
simulations and analysis. The accomplishment of a sub-
graph relies on a graph sampling process. This sampling
process aims at selecting a set of vertices and edges in a
way that the resulting sub-graph obeys some general
characteristics of the original graph. In this paper, we
focus on developing a large-scale graph sampling
technique.
Generally, sampling large graph encounters three
questions [3]. What is good sampling method? What is a
good sample size? How do we measure the goodness of a
single sample as well as the goodness of a whole sampling
method? Many researchers have proposed their solutions
to sample social networks. There are some state of the art
sampling algorithms: Random Node (RN) sampling,
Random PageRank Node (RPN) sampling, Random
Degree Node (RDN) sampling, Random Edge (RE)
sampling, Random Walk (RW) sampling, Random Jump
sampling(RJ), Forest Fire (FF) sampling, Breadth-First
Sampling (BFS) and other sampling strategies, which will
be introduced briefly in section II. For these algorithms,
sampling size is preset by users so that users can get their
ideal sampled graph. In the specific sampling process,
maintaining similar properties between the sampled graph
and the original graph is significant. Only the sampled
graph represents the original graph well, can we study the
sampled graph instead of the original graph. How to
evaluate whether the sampled graph and original graph
have similar properties? Now there are some techniques to
measure the similarity which will be introduced in section
II.
In this paper, we have proposed a new big social
network sampling algorithm based on graph partition
(SGP). The proposed SGP algorithm firstly partitions the
original network into several sub-networks, and then
stratifies samples vertices in each sub-network. Thus, this
procedure enables SGP to effectively maintain the
topological similarity and community structure similarity
between the sampled network and its original network.
The rest of the paper is organized as follows: Section II
presents the related works. Section III describes the
proposed SGP algorithm based on graph partition. The
experiment process and the experimental results will be
presented in Section IV. Finally, Section V concludes the
paper.
II. RELATED WORKS
In this section, we will introduce some network
sampling algorithms and performance evaluations
respectively.
A. Graph Sampling Algorithms
Currently, there have been several state-of-the-art
graph sampling algorithms. Conceptually, we can split
these existing algorithms into three groups [3]: methods
based on randomly selecting vertices, methods relying on
randomly selecting edges, and exploration techniques that
simulate random walks or virus propagation to find a
representative sample of the vertices.