Int J Data Sci Anal (2017) 3:285–296 287
Heterogeneous Information Network
Heterogeneous information network (HIN), with different
types of nodes and relations, has richer semantic informa-
tion than general network with single type. Therefore, HIN
provides a new orientation to manage networked data. And
kinds of data mining tasks for HIN are realized recently.
These research developments include similarity measure
[12,24], clustering [25,26], classification [9,11], link pre-
diction [2,22], ranking [14,34], recommendation [8,18],
information fusion [10,19]. But these tasks just work on sim-
ple HINs with simple schema. The data mining tasks for
schema-rich HIN are few to be done, and the existing meth-
ods for simple HIN are not appropriate to the situation of
schema-rich HIN. We could take more attention to the study
of schema-rich HIN.
Link Prediction in HIN
With the prevalence of HIN, link prediction in HIN has
attracted many researchers. Using the meta-path feature,
some works have been done [2,22,23,30]insimpleHIN.
Sun et al. [22] proposed PathPredict to solve the problem of
co-author relationship prediction by extracting meta-path-
based feature and building logistic regression-based model.
Cao et al. [2] designed an iterative framework to predict mul-
tiple types of links collectively in HIN. Zhang et al. [33]
utilized meta-paths to predict organization chart or manage-
ment hierarchy. And Sun et al. [23] modeled the distribution
of relationship building time to predict when a certain rela-
tionship will be formed.
Some researchers also utilize probabilistic models to do
link prediction tasks in HIN. For example, Yang et al. [29]
developed a probabilistic method MRIP to predict links in
multi-relational heterogeneous networks. Dong et al. [5]
proposed a transfer-based ranking factor graph model that
combines several social patterns with network structure
information for link prediction and recommendation. Huang
et al. [6] designed the joint manifold factorization (JMF)
method to do trust prediction with the ancillary rating matrix
via aggregating HINs.
The methods mentioned above mostly focus on link pre-
diction in one single HIN. Recently, some works study the
problems of link prediction across multiple aligned HINs
[13,32]. Zhang et al. [32] proposed SCAN-PS method to
solve the social link prediction problem for new users using
the “anchors.” Liu et al. [13] designed the aligned factor
graph model for user–user link prediction problem by utiliz-
ing information from another similar social network.
However, the research developments of link prediction
are all developed for simple HIN. When the HIN becomes
bigger and more complicated, we should design different link
prediction methods for it.
3 Preliminary and problem definition
In this section, we introduce some basic concepts used in this
paper and give the problem definition.
Heterogeneous information network (HIN) [7]isakind
of information network defined as a directed network graph
G = (V, E), which consists of either different types of nodes
V or different types of edges E. Specifically, an information
network can be abstracted to a network schema M = (R, L)
where R is the set of the node types and L is the set of
the edge types, and there is a node-type mapping function
θ : V → R and an edge-type mapping function ϕ : E → L.
When the number of node types |R| > 1 or the number of
edge types |L| > 1, the network is a heterogeneous infor-
mation network. For example, in bibliographic database, like
DBLP [4], papers are connected together via authors, venues
and terms, and they can be organized as a star schema HIN.
Another example is the users and items in e-commerce web-
site which constitutes a bipartite HIN [8].
In an HIN, there can be different paths connecting two
entity nodes, and these paths are called as meta-path [25].
A meta-path
that is defined as
R
1
,··· , R
l+1
= R
1
L
1
−→
R
2
L
2
−→ ···
L
l
−→ R
l+1
, which describes a path between two
node types R
1
and R
l+1
, going through a series of node types
R
1
, ··· , R
l+1
and a series of link types L
1
, ··· , L
l
. Tak-
ing the knowledge graph in Fig. 1 as an example, we can
consider this Knowledge Graph as an HIN, which includes
many different node types (e.g., person, city, country) and
link types (e.g., bornIn and locatedIn). Every two node
types can be connected by multiple meta-paths. For exam-
ple, there are two meta-paths connecting Person and Country
in Fig. 1: Person
bor n in
−−−−→ City
located In
−−−−−→ Country and
Person
Di ed i n
−−−−→ City
hasC api tal
−1
−−−−−−−−→ Country. Obviously,
different meta-paths show different semantic meanings, so
that nodes connected by different meta-paths have different
similarity. Thus, we can calculate the similarity of entity node
pairs based on different meta-paths, which represent different
features.
To use meta-path feature properly, meta-path-based sim-
ilarity measures are proposed to make meta-path feature
quantization and quantify the similarity of nodes [12,16,24]
in HIN. Most of the studies or applications of HIN are based
on these similarity measures to be performed. Sun et al. [24]
proposed PathSim to calculate the similarity of the same-
typed entity nodes based on symmetric paths. Lao et al. [12]
designed a path-constrained random walk (PCRW) algorithm
to measure the entity relativity in a labeled directed graph.
Shi et al. [16] proposed HeteSim to measure relevance of any
entity pair under arbitrary meta-path. Although all of these
measures can do similarity calculation, not every measure
could be used for fast calculation in the process of finding
123