2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE)
978-1-5386-1829-5/17/$31.00 ©2017 IEEE
A Relation Prediction Method Based on PU Learning
Gao-Jing Peng, Ke-Jia Chen, Shijun Xue, Bin Liu
Jiangsu Key Laboratory of Big Data Security
& Intelligent Processing
Nanjing University of Posts and Telecommunications
Nanjing, Jiangsu 210046, China
penggj_njupt@163.com, chenkj@njupt.edu.cn, xueshijun_2010@163.com, bins@ieee.org
Abstract—This paper studies
relation prediction in
heterogeneous information networks under PU learning
context. One of the challenges of this problem is the imbalance
of data number between the positive set P (the set of node pairs
with the target relation) and the unlabeled set U (the set of
node pairs without the target relation). We propose a K-means
and voting mechanism based technique SemiPUclus to extract
ve set RN from U under a new relation
prediction framework PURP. The experimental results show
that PURP achieves better performance than comparative
methods in DBLP co-authorship network data.
Keywords-link prediction; relation prediction; heterogeneous
information networks; PU learning
I. INTRODUCTION
Link prediction aims to predict the formation possibility
of missing links
in a network based on the
network’s current or historical data. It has a wide range of
applications, such as citation prediction in a bibliographic
dataset, product recommendation in an e-commerce service,
online advertisement click prediction in an online network
[1]. Most of the existing link prediction methods
are proposed for homogeneous information networks where
there is only one single type of nodes and edges.
However, types of nodes and edges in real networks are
usually multiple. These networks are called heterogeneous
(HINs). In HIN, structural
dependencies of different relations also increase the
difficulty of link prediction [2] [3] [4]. Recently, Sun et al. [5]
used the concept of meta-path in HINs
prediction problem, which can be seen as an extension of
link prediction problem. Here is an example of relation
prediction in a co-authorship network (Figure 1). The
network includes four types of nodes and ten types of links.
The
relation to predict is the co-authorship between
any author pair, which can be represented by the meta-path
-1
write write
Author Paper Authoroo
.
Relation prediction can be treated as a supervised
learning process. If the target relation exists between a
1
and
a
2
, the label of node pair <a
1
, a
2
> is set to “+1”, otherwise it
is set to “-1”. This process normally requires a lot of
positive examples and negative examples to train the model.
However, the number of negative examples is often limited
or not available in many real-world fields. The PU learning
Figure 1. The co-authorship
technique will enable the use of positive and unlabeled
examples to construct a classification model.
In the above co-authorship network, if the target relation
between a
1
and a
2
does not exist at the moment, it does not
mean that the target relation
future. So
the label of a node pair without the target relation is better
set to “0” instead of “-1”. With this assumption, all node
pairs are now divided into the positive example set P and
the unlabeled example set U.
PU learning has become a new research topic in the field
of classification. Though widely used in text mining, graph
mining and so on
[7] [8] [9], it was not used in link mining
until recent years. In 2014, Zhang et al. [6] used PU learning
for the first time to predict anchor links between multiple
networks. They used the Spy technique to extract reliable
negative examples. Different from their work, this paper
aims to
predict the target relation
and does not limit to links
in a single HIN.
The main challenges of relation prediction in HIN are:
y Extraction of reliable negative examples
The most important challenge of PU learning is to
extract
reliable negative example
s RN from U. But
for relation prediction in HIN, most of the existing
semi-supervised PU learning methods are not
suitable any more. It is necessary to design a new
efficient method.
Heterogeneity of network
The types of nodes or links are multiple in HIN, so
traditional link prediction methods of homogeneous
information networks are no longer applicable. Also,
relations between nodes and
heterogeneity of links bring great difficulties for the
prediction task.
y Link sparsity