没有合适的资源?快使用搜索试试~ 我知道了~
首页2014年C题outstanding奖
资源详情
资源推荐
Who Is the Hidden Champion in a Network?
February 10, 2014
1 Summary
Network science is essential in many civil and military applications due to its great help
to complex systems with network-based structures. Since the co-author relationship as well as
the citation relationship forms a network structure in research areas, network science can be
utilized to do some data-mining works in this network, e.g. seeking influential researchers and
important research papers.
In this paper, we extract useful data from given large datasets and establish the required co-
author and citation networks. For visual clarity of the network, we propose and implement the
stress majorization algorithm based on the minimal-energy principle in physics, which provides
a proper 2-D layout for complex networks. Then we develop four different centrality measures
based on the perspectives of trivial sum, weighted sum, geographic control and informational
control, respectively, for undirected graph and three centrality measures for directed graph, in
which both the non-parametric and parametric methodologies are involved. Furthermore, we
propose several methods to transform a directed graph into an undirected one. We show that
these measures generate consistent results for both the most influential researcher and the most
important paper.
Then we extend our model from research areas to other areas of society, and show that our
model still yields convincing results for top singers, songwriters and movie stars, even when a
bipartite graph is used instead. Moreover, we conclude that a crude criterion for a real problem
to be solved by network science in essence is that there is a network structure and something
in interest transmitting through the network, with all real experiments conforming to this s-
tatement. In addition, it is proved theoretically and verified numerically that our parametric
approach guarantees its convergence, hence there are only loose requirements for parameters.
In summary, our model is practical and reliable for handling network-based problems in real-
ity. Nevertheless, there are some existing problems such as computational complexity and lack
of automatic adaptation preventing our model from convenient implementation, and certain
phenomenon such as information cascade lies beyond the scope of our modelling.
Key Words: network science, co-author, citation, centrality, stress majorization
1
Team # 30221 Page 2 of 20
Contents
1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Task 1: Co-author Network of the Erdos1 Author . . . . . . . . . . . . . . . . . . . . . . 3
4 Task 2: Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1 Degree Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Eigenvector Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 Closeness Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.4 Betweenness Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.5 Summary of Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Task 3: The Citation Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.1 Undirected Graph Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.2 Directed Graph Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6 Task 4: Applications in Singers and Movie Actors . . . . . . . . . . . . . . . . . . . . . 13
6.1 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.2 Bipartite Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7 Task 5: Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
8.1 Non-parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
8.2 Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
9 Strengths and Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
9.1 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
9.2 Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Team # 30221 Page 3 of 20
2 Introduction
Network science has gained its popularity recently due to the considerable network struc-
tures emerging in reality, which can be of great help to data mining, dynamic systems, etc. In
academic fields, we can establish the corresponding network structures for both citation and co-
authoring relationships, and then the data mining work within these networks is of great interest
and significance. Some simple questions are raised: how to measure the influence of a researcher, or
how to evaluate the importance of a research paper?
To answer this question, there are network-based evaluation tools that use co-author and
citation data to determine the impact factor of researchers, publications and journals, such as
Science Citation Index (SCI), H-factor, Impact factor, Eigenfactor, etc. Our goal is to design
effective measures to analyze influence in research networks, and then extend it to other areas
of society. Specifically, we will do the following things in this paper:
• Establish a co-author network using given data and plot it on a 2-D plane, then propose
several measures to determine who is the most influential researcher.
• Propose models to evaluate the importance of research papers using co-author and citation
data.
• Extend the models to other areas in society or other entities in research areas, then test and
analyze its performances.
• Gain heuristics from this model and discuss how individuals can learn from it in reality.
• Implement the sensitivity test and analyze its strengths and weaknesses, as well as the
potential research directions.
3 Task 1: Co-author Network of the Erdos1 Author
First of all, we build the co-author network of the Erdos1 authors from the source provid-
ed by the problem. Due to the symmetry of the pairwise co-author relationship, it is natural
to utilize graph theory to fully characterize it, where we establish a simple undirected graph and
regard researchers as vertices and the co-authoring relationship as edges. For computational sim-
plicity, we use an adjacency matrix A = (a
ij
)
N×N
to store this relationship, where N = 511 is the
number of researchers who have co-authored with Erd
¨
os, and the entry a
ij
is an indicator of the
relationship:
a
ij
=
(
1 researcher i has co-authored with researcher j
0 elsewhere
(1)
In particular, since this graph is undirected, A is symmetric.
Observing that the dataset is large (with 511 researchers and over 18,000 raw lines), we use
programming to accomplish data extraction. Specifically, we just write a simple string matching
algorithm by Ruby codes, which exports the corresponding adjacency matrix (excluding Erd
¨
os)
for further use
1
. Once we obtain the adjacency matrix A, some of its properties such as transitiv-
ity ratio
2
are listed as follows.
1
Note that the resulting matrix is not symmetric due to some minor errors in source data, and we force its symmetry
by assigning a
ij
← max(a
ij
, a
ji
).
2
The transitivity ratio describes how vertices tend to cluster and is defined as
C ,
3 × number of triangles
number of connected triples of vertices
=
trace(A
3
)
sum(A
2
− A)
Team # 30221 Page 4 of 20
Table 1: Basic Properties for Co-authoring Relationship
Number of vertices Number of edges Average degree Transitivity ratio
511 1713 6.70 0.2238
Since the average degree 6.70 511, we observe that this graph is sparse. Moreover, accord-
ing to the transitivity ratio, this co-authoring relationship is moderately transitive. These results
meet our intuition that researchers often co-authored with those who they are familiar with, but
there are also some restrictions on pairwise cooperation, e.g. time limitation. To see deeper into
this graph, we summarize its degree distribution in Table 2.
Table 2: Degree Distribution
Range of degree 0 [1,2] [3,5] [6,10] [11,20] [21,30] [31,d
max
=52]
Number of vertices 37 150 122 114 51 23 14
As is illustrated in the table, majority of vertices have small degrees, while a fraction (16%)
of researchers are somewhat well-known who have co-authored with many researchers (more
than 10) in this list. This fact could also be illustrated by the network plotted on a 2-D plane,
while it is a challenging task owing to the obvious difficulties in designing a proper and visible
layout. It is quite complicated to define visible here, but intuitively, it is better to place two linked
vertices to be close, and place the distance of those which there is no edge connecting them to
be relatively far.
Using the heuristics above, Gansner et al.[1] developed a graph-drawing method called stress
majorization. Assume that there exists an optimal distance between each pair (this distance is set
according to certain criteria, such as the one discussed above), then we can approach this optimal
state by setting the vertices properly. We consider a metric to measure the so-called approaching
as follows.
stress(X) ,
X
i<j
w
ij
(kX
i
− X
j
k − d
ij
)
2
(2)
where X
i
∈ R
2
is the 2-D position for vertex i, d
ij
, w
ij
’s are optimal distances and corresponding
weights, respectively, and k · k is the Euclidean distance. [1] suggests:
• d
ij
is the overall length of the shortest path connecting i and j (is computed using Floyd
Shortest-Path Algorithm[2]), with each edge (i, j) assigning a length
l
ij
= #{k : exactly one of (k, i) and (k, j) is an edge} (3)
• w
ij
= d
−2
ij
is the corresponding weight.
We can imagine that there is a mechanic spring with equilibrium length d
ij
and stiffness
coefficient w
ij
between each pair of vertices, then physical force will drive the system into a
state with minimal energy regardless of the initial state. This motivates us to use iterative methods,
where each iteration is an optimization step. Consider
F (X, Y ) ,
X
i<j
w
ij
d
2
ij
+ trace(X
T
LX) − 2 · trace(X
T
L
Y
Y ) (4)
where
L
ij
,
(
−w
ij
i 6= j
P
k6=i
w
ik
i = j
, L
Y
ij
,
0 i 6= j, Y
i
= Y
j
−d
ij
w
ij
/kY
i
− Y
j
k i 6= j, Y
i
6= Y
j
−
P
k6=i
L
Y
ik
i = j
(5)
剩余19页未读,继续阅读
Dchange
- 粉丝: 3
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- C++标准程序库:权威指南
- Java解惑:奇数判断误区与改进方法
- C++编程必读:20种设计模式详解与实战
- LM3S8962微控制器数据手册
- 51单片机C语言实战教程:从入门到精通
- Spring3.0权威指南:JavaEE6实战
- Win32多线程程序设计详解
- Lucene2.9.1开发全攻略:从环境配置到索引创建
- 内存虚拟硬盘技术:提升电脑速度的秘密武器
- Java操作数据库:保存与显示图片到数据库及页面
- ISO14001:2004环境管理体系要求详解
- ShopExV4.8二次开发详解
- 企业形象与产品推广一站式网站建设技术方案揭秘
- Shopex二次开发:触发器与控制器重定向技术详解
- FPGA开发实战指南:创新设计与进阶技巧
- ShopExV4.8二次开发入门:解决升级问题与功能扩展
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功