KnowSim：基于类型化的异构信息网络的文档相似度计算

7 浏览量更新于2024-08-26 收藏 346KB PDF 举报

"KnowSim：结构化异构信息网络上的文档相似性度量"是一篇研究论文，关注于在复杂的信息检索和理解场景中提升文档相似性的精确度。传统的文本处理方法倾向于将文档表示为词袋模型（Bag-of-Words），通过余弦相似度、Jaccard系数或Dice系数来计算文档间的相似性。然而，这些方法往往忽视了文档中的实体短语和它们在上下文中的关键作用，以及不同类型实体之间的关系。论文提出了一种新颖的文档表示方法，即将文档转化为一种类型化的异构信息网络（Heterogeneous Information Network，HIN）。在这个网络中，每个实体和关系都被赋予了特定的类型，使得文档不仅由单词构成，还包含了丰富的实体及其相互联系。这种网络结构允许不同文档之间通过共有的单词和实体进行连接，从而提供更全面的文档关联性分析。通过将文档相似性问题转化为图的距离计算问题，KnowSim能够更好地捕捉到文档中实体和关系的丰富信息，进而提高相似性评估的准确性和深度。这种方法不仅考虑了单个词的重要性，还考虑了实体短语的整体含义以及它们在文档主题中的影响力。这对于诸如文档分类、聚类和排名等任务具有重要意义，因为它们依赖于对文档内在关联性的精细理解和挖掘。作者Chenguang Wang、Yangqiu Song、Haoran Li、Ming Zhang和Jiawei Han分别来自北京大学电子工程与计算机科学学院以及伊利诺伊大学厄巴纳-香槟分校计算机科学系，他们的研究旨在填补传统方法的空白，推动文档相似性度量在信息技术领域的发展，并可能对未来信息检索、知识图谱和自然语言处理技术产生深远影响。

KnowSim: A Document Similarity Measure on

Structured Heterogeneous Information Networks

Chenguang Wang

†

, Yangqiu Song

‡

, Haoran Li

†

, Ming Zhang

†

, Jiawei Han

‡

†

School of EECS, Peking University

‡

Department of Computer Science, University of Illinois at Urbana-Champaign

{wangchenguang, lihaoran 2012, mzhang cs}@pku.edu.cn, {yqsong, hanj}@illinois.edu

Abstract—As a fundamental task, document similarity mea-

sure has broad impact to document-based classiﬁcation, clustering

and ranking. Traditional approaches represent documents as bag-

of-words and compute document similarities using measures like

cosine, Jaccard, and dice. However, entity phrases rather than

single words in documents can be critical for evaluating document

relatedness. Moreover, types of entities and links between enti-

ties/words are also informative. We propose a method to represent

a document as a typed heterogeneous information network (HIN),

where the entities and relations are annotated with types. Multiple

documents can be linked by the words and entities in the HIN.

Consequently, we convert the document similarity problem to

a graph distance problem. Intuitively, there could be multiple

paths between a pair of documents. We propose to use the meta-

path deﬁned in HIN to compute distance between documents.

Instead of burdening user to deﬁne meaningful meta-paths, an

automatic method is proposed to rank the meta-paths. Given

the meta-paths associated with ranking scores, an HIN-based

similarity measure, KnowSim, is proposed to compute document

similarities. Using Freebase, a well-known world knowledge base,

to conduct semantic parsing and construct HIN for documents,

our experiments on 20Newsgroups and RCV1 datasets show that

KnowSim generates impressive high-quality document clustering.

I. INTRODUCTION

Document similarity is a fundamental task, and can be used

in many applications such as document classiﬁcation, clus-

tering and ranking. Traditional approaches use bag-of-words

(BOW) as document representation and compute the document

similarities using different measures such as cosine, Jaccard,

and dice. However, the entity phrases rather than just words

in documents can be critical for evaluating the relatedness

between texts. For example, “New York” and “New York

Times” represent different meanings. “George Washington”

and “Washington” are similar if they both refer to person, but

can be rather different otherwise. If we can detect their names

and types (coarse-grained types such as person, location and

organization; ﬁne-grained types such as politician, musician,

country, and city), they can help us better evaluate whether two

documents are similar. Moreover, the links between entities

or words are also informative. For example, as Fig. 1 shown

in [1], the similarity between the two documents is zero if

we use BOW representation since there is no identical word

shared by them. However, the two documents are related in

contents. If we can build a link between “Obama” of type

Politician in one document and “Bush” of type Politician in

another, then the two documents become similar in the sense

that they both talk about politicians and connect to “United

States.” Therefore, we can use the structural information in the

unstructured documents to further improve document similarity

computation.

Some existing studies use linguistic knowledge bases such

as WordNet [2] or general purpose knowledge bases such

as Open Directory Project (ODP) [3], Wikipedia [4], [5],

[6], [7], [8], [9], or knowledge extracted from open domain

data such as Probase [10], [11], to extend the features of

documents to improve similarity measures. However, they treat

knowledge in such knowledge bases as “ﬂat features” and do

not consider the structural information contained in the links

in knowledge bases. There have been studies on evaluating

word similarity or string similarity based on WordNet or other

knowledge [12] considering the structural information [13],

and using word similarity to compute short text similarity [14],

[15]. For example, the distance from words to the root is

used to capture the semantic relatedness between two words.

However, WordNet is designed for single words. For named

entities, a separate similarity should be designed [14], [16].

These studies do not consider the relationships between entities

(e.g., “Obama” being related to “United States”). Thus, they

may still lose structural information even if the knowledge

base provides rich linked information. For example, nowadays

there exist numerous general-purpose knowledge bases, e.g.,

Freebase [17], KnowItAll [18], TextRunner [19], WikiTax-

onomy [20], DBpedia [21], YAGO [22], NELL [23] and

Knowledge Vault [24]. They contain a lot of world knowledge

about entity types and their relationships and provide us rich

opportunities to develop a better measure to evaluate document

similarities.

In this paper, we propose KnowSim, a heterogeneous

information network (HIN) [25] based similarity measure that

explores the structural information from knowledge bases to

compute document similarities. We use Freebase as the source

of world knowledge. Freebase is a collaboratively collected

knowledge base about entities and their organizations [17]. We

follow [1] to use the world knowledge speciﬁcation framework

including a semantic parser to ground any text to the knowl-

edge bases, and a conceptualization-based semantic ﬁlter to re-

solve the ambiguity problem when adapting world knowledge

to the corresponding document. By the speciﬁcation of world

knowledge, we have the documents as well as the extracted

entities and their relations. Since the knowledge bases provide

entity types, the resulting data naturally form an HIN. The

named entities and their types, as well as the documents and

the words form the HIN.

Given a constructed HIN, we use meta-path based simi-

larity [26] to measure the similarity between two documents

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38690402

粉丝: 5
资源: 1007

KnowSim：基于类型化的异构信息网络的文档相似度计算

基于特征子图的异构信息网络节点相似性度量 *

CU_HIN:CU的异构信息网络

HinDroid：基于结构化异构信息网络的智能Android恶意软件检测系统

特征子图的异构信息网络节点相似性度量算法

HINE：深入解析异构信息网络嵌入技术

PIONEER管控架构：多域异构光网络互连的高效解决方案

复杂网络免疫策略分析：同构到异构网络的抑制效果研究

阈值分簇方法：提升无线异构网络基站性能

阈值基站分簇法：提升无线异构网络性能

5G绿色演进：超密集异构网络的技术与挑战

最新资源