Meta Structure: Computing Relevance in
Large Heterogeneous Information Networks
Zhipeng Huang, Yudian Zheng, Reynold Cheng, Yizhou Sun
†
, Nikos Mamoulis, Xiang Li
The University of Hong Kong,
†
Northeastern University
{zphuang, ydzheng2, ckcheng, nikos, xli2}@cs.hku.hk,
†
yzsun@ccs.neu.edu
ABSTRACT
A heterogeneous information network (HIN) is a graph model in
which objects and edges are annotated with types. Large and com-
plex databases, such as YAGO and DBLP, can be modeled as HINs.
A fundamental problem in HINs is the computation of closeness,
or relevance, between two HIN objects. Relevance measures can
be used in various applications, including entity resolution, rec-
ommendation, and information retrieval. Several studies have in-
vestigated the use of HIN information for relevance computation,
however, most of them only utilize simple structure, such as path,
to measure the similarity between objects. In this paper, we pro-
pose to use meta structure, which is a directed acyclic graph of
object types with edge types connecting in between, to measure the
proximity between objects. The strength of meta structure is that it
can describe complex relationship between two HIN objects (e.g.,
two papers in DBLP share the same authors and topics). We de-
velop three relevance measures based on meta structure. Due to the
computational complexity of these measures, we further design an
algorithm with data structures proposed to support their evaluation.
Our extensive experiments on YAGO and DBLP show that meta
structure-based relevance is more effective than state-of-the-art ap-
proaches, and can be efficiently computed.
1. INTRODUCTION
Heterogeneous information networks (HINs), such as DBLP [8],
YAGO [15], DBpedia [1] and Freebase [2], have recently received
a lot of attention. These graph data sources contain a vast number
of inter-related facts, and they are used to facilitate the discovery
of interesting knowledge [5, 7, 12, 13]. Figure 1 illustrates an HIN,
which describes the relationship among entities of different types
(e.g., author, paper, venue and topic). For example, Jiawei Han
(a
2
) has written a VLDB paper (p
2,2
), which mentions the topic
“efficient” (t
3
).
Given two HIN objects a and b, the evaluation of their relevance
is of fundamental importance. This quantifies the degree of close-
ness between a and b. In Figure 1, Jian Pei (a
1
) and Jiawei Han
(a
2
) have a high relevance score, since they have both published pa-
pers with keyword “mining” in the same venue (KDD). Relevance
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
KDD ’16, August 13-17, 2016, San Francisco, CA, USA
c
2016 ACM. ISBN 978-1-4503-4232-2/16/08. . . $15.00
DOI: http://dx.doi.org/10.1145/2939672.2939815
KDD
“mining”
AAAI
VLDB
“efficient”
“privacy”
AAAI’15
VLDB’15
KDD’15
KDD’07
ICDM
“social”
ICDM’12
write publishmention
VLDB’06
author
paper
venue topic
object types:
edge types:
Figure 1: Illustrating an HIN.
finds its applications in information retrieval, recommendation, and
clustering [18, 22]: a researcher can retrieve papers that have high
relevance in terms of topics and venues in DBLP; in YAGO, rele-
vance facilitates the extraction of actors who are close to a given
director. As another example, in entity resolution applications, du-
plicated HIN object pairs having high relevance scores (e.g., two
different objects in an HIN referring to the same real-world person)
can be identified and removed from the HIN.
Prior works. To measure the relevance between two graph ob-
jects, neighborhood-based measures such as common neighbors
and Jaccard’s coefficient were proposed [9]. Other graph-theoretic
measures that are based on random walks between objects include
Personalized PageRank [3] and SimRank [6]. These measures do
not consider object and edge type information in an HIN. To handle
this information, the concept of meta paths has been recently pro-
posed [7, 18]. A meta path is a sequence of object types with edge
types in between. Figure 2(b) illustrates a meta path P
1
, which
states that two authors (A
1
and A
2
) are related by their publica-
tions in the same venue (V ). Another meta path P
2
says that two
authors have written papers containing the same topic (T ). Based
on a meta path, several relevance measures, such as PathCount,
PathSim, and Path Constrained Random Walk (PCRW) [7,18] have
been proposed. These measures have been shown to be better than
those that do not consider object and edge type information.
Meta structures. We propose a novel concept, named meta
structure, to depict the relationship of two graph objects. This is
essentially a directed acyclic graph of object and edge types. Fig-
ure 2(b) illustrates a meta structure S, which depicts that two au-
thors are relevant if they have published papers in the same venue,
and have also mentioned the same topic. A meta path (e.g., P
1
or