富模式异构信息网络中的元路径链接预测

72 浏览量更新于2024-08-27 收藏 1.51MB PDF 举报

"这篇研究论文探讨了在模式丰富的异构信息网络中基于元路径的链接预测方法。异构信息网络(HIN)近年来迅速发展，它包含不同类型的节点和关系，具有复杂的结构和丰富的语义。元路径是连接这些节点的物体类型和关系的序列，被广泛用于挖掘HIN中的语义信息。链接预测是一项重要的数据挖掘任务，预测节点之间的潜在链接，对于填充缺失链接等应用至关重要。传统的方法通常基于简单的HIN，如二分图或星型模式，并需要预先定义或枚举元路径。然而，在许多实际的网络数据中，用简单模式描述其网络结构变得困难。" 在这篇论文中，作者Xiaohuan Cao、Yuyan Zheng、Chuan Shi、Jingzhi Li和Bin Wu提出了一种新的方法，旨在解决在模式丰富的异构信息网络中进行链接预测的问题。他们强调，传统的元路径依赖方法可能不适用于那些网络结构复杂且难以预定义元路径的数据集。首先，论文解释了异构信息网络的基本概念。HIN是由不同类型的实体（节点）和它们之间的多种关系组成的网络。这些实体可以是人、组织、事件等，而关系可能是“属于”、“工作于”等。由于其复杂性，HIN需要更深入的分析方法来揭示隐藏的语义关联。元路径是理解HIN语义的关键工具，它由节点类型和关系组成的一系列连接。例如，“用户-发表-论文-用户”是一个元路径，表示两个用户通过共同发表论文的方式相互关联。元路径的选择对链接预测的准确性和效率有直接影响。然后，论文探讨了现有的链接预测方法，特别是基于元路径的方法。这些方法通常假设已知特定的元路径，并利用这些路径的相似度来预测新链接。然而，这忽略了网络中可能存在的大量未知或未充分利用的元路径。为了解决这个问题，论文提出了一个新颖的框架，该框架能够自动学习和利用模式丰富的异构信息网络中的有效元路径，无需预先定义。这种方法有望提高链接预测的准确性和泛化能力，同时适应复杂网络结构的变化。此外，论文还可能涵盖了评价模型性能的标准，比如精度、召回率和F1分数，以及可能采用的实验设置，包括数据集选择、比较方法和结果分析。通过实验证明，所提出的模型在预测性能上可能优于传统的元路径依赖方法。这篇研究论文为模式丰富的异构信息网络中的链接预测提供了新的视角，通过自动化学习元路径，提高了预测的准确性和实用性，对于理解和处理复杂网络数据具有重要意义。

Int J Data Sci Anal (2017) 3:285–296 287

Heterogeneous Information Network

Heterogeneous information network (HIN), with different

types of nodes and relations, has richer semantic informa-

tion than general network with single type. Therefore, HIN

provides a new orientation to manage networked data. And

kinds of data mining tasks for HIN are realized recently.

These research developments include similarity measure

[12,24], clustering [25,26], classiﬁcation [9,11], link pre-

diction [2,22], ranking [14,34], recommendation [8,18],

information fusion [10,19]. But these tasks just work on sim-

ple HINs with simple schema. The data mining tasks for

schema-rich HIN are few to be done, and the existing meth-

ods for simple HIN are not appropriate to the situation of

schema-rich HIN. We could take more attention to the study

of schema-rich HIN.

Link Prediction in HIN

With the prevalence of HIN, link prediction in HIN has

attracted many researchers. Using the meta-path feature,

some works have been done [2,22,23,30]insimpleHIN.

Sun et al. [22] proposed PathPredict to solve the problem of

co-author relationship prediction by extracting meta-path-

based feature and building logistic regression-based model.

Cao et al. [2] designed an iterative framework to predict mul-

tiple types of links collectively in HIN. Zhang et al. [33]

utilized meta-paths to predict organization chart or manage-

ment hierarchy. And Sun et al. [23] modeled the distribution

of relationship building time to predict when a certain rela-

tionship will be formed.

Some researchers also utilize probabilistic models to do

link prediction tasks in HIN. For example, Yang et al. [29]

developed a probabilistic method MRIP to predict links in

multi-relational heterogeneous networks. Dong et al. [5]

proposed a transfer-based ranking factor graph model that

combines several social patterns with network structure

information for link prediction and recommendation. Huang

et al. [6] designed the joint manifold factorization (JMF)

method to do trust prediction with the ancillary rating matrix

via aggregating HINs.

The methods mentioned above mostly focus on link pre-

diction in one single HIN. Recently, some works study the

problems of link prediction across multiple aligned HINs

[13,32]. Zhang et al. [32] proposed SCAN-PS method to

solve the social link prediction problem for new users using

the “anchors.” Liu et al. [13] designed the aligned factor

graph model for user–user link prediction problem by utiliz-

ing information from another similar social network.

However, the research developments of link prediction

are all developed for simple HIN. When the HIN becomes

bigger and more complicated, we should design different link

prediction methods for it.

3 Preliminary and problem deﬁnition

In this section, we introduce some basic concepts used in this

paper and give the problem deﬁnition.

Heterogeneous information network (HIN) [7]isakind

of information network deﬁned as a directed network graph

G = (V, E), which consists of either different types of nodes

V or different types of edges E. Speciﬁcally, an information

network can be abstracted to a network schema M = (R, L)

where R is the set of the node types and L is the set of

the edge types, and there is a node-type mapping function

θ : V → R and an edge-type mapping function ϕ : E → L.

When the number of node types |R| > 1 or the number of

edge types |L| > 1, the network is a heterogeneous infor-

mation network. For example, in bibliographic database, like

DBLP [4], papers are connected together via authors, venues

and terms, and they can be organized as a star schema HIN.

Another example is the users and items in e-commerce web-

site which constitutes a bipartite HIN [8].

In an HIN, there can be different paths connecting two

entity nodes, and these paths are called as meta-path [25].

A meta-path



that is deﬁned as



,··· , R

l+1

= R

−→

−→ ···

−→ R

l+1

, which describes a path between two

node types R

and R

l+1

, going through a series of node types

, ··· , R

l+1

and a series of link types L

, ··· , L

. Tak-

ing the knowledge graph in Fig. 1 as an example, we can

consider this Knowledge Graph as an HIN, which includes

many different node types (e.g., person, city, country) and

link types (e.g., bornIn and locatedIn). Every two node

types can be connected by multiple meta-paths. For exam-

ple, there are two meta-paths connecting Person and Country

in Fig. 1: Person

bor n in

−−−−→ City

located In

−−−−−→ Country and

Person

Di ed i n

−−−−→ City

hasC api tal

−1

−−−−−−−−→ Country. Obviously,

different meta-paths show different semantic meanings, so

that nodes connected by different meta-paths have different

similarity. Thus, we can calculate the similarity of entity node

pairs based on different meta-paths, which represent different

features.

To use meta-path feature properly, meta-path-based sim-

ilarity measures are proposed to make meta-path feature

quantization and quantify the similarity of nodes [12,16,24]

in HIN. Most of the studies or applications of HIN are based

on these similarity measures to be performed. Sun et al. [24]

proposed PathSim to calculate the similarity of the same-

typed entity nodes based on symmetric paths. Lao et al. [12]

designed a path-constrained random walk (PCRW) algorithm

to measure the entity relativity in a labeled directed graph.

Shi et al. [16] proposed HeteSim to measure relevance of any

entity pair under arbitrary meta-path. Although all of these

measures can do similarity calculation, not every measure

could be used for fast calculation in the process of ﬁnding

123

剩余11页未读，继续阅读

粉丝:
资源:

富模式异构信息网络中的元路径链接预测

最新资源