KnowSim: A Document Similarity Measure on
Structured Heterogeneous Information Networks
Chenguang Wang
†
, Yangqiu Song
‡
, Haoran Li
†
, Ming Zhang
†
, Jiawei Han
‡
†
School of EECS, Peking University
‡
Department of Computer Science, University of Illinois at Urbana-Champaign
{wangchenguang, lihaoran 2012, mzhang cs}@pku.edu.cn, {yqsong, hanj}@illinois.edu
Abstract—As a fundamental task, document similarity mea-
sure has broad impact to document-based classification, clustering
and ranking. Traditional approaches represent documents as bag-
of-words and compute document similarities using measures like
cosine, Jaccard, and dice. However, entity phrases rather than
single words in documents can be critical for evaluating document
relatedness. Moreover, types of entities and links between enti-
ties/words are also informative. We propose a method to represent
a document as a typed heterogeneous information network (HIN),
where the entities and relations are annotated with types. Multiple
documents can be linked by the words and entities in the HIN.
Consequently, we convert the document similarity problem to
a graph distance problem. Intuitively, there could be multiple
paths between a pair of documents. We propose to use the meta-
path defined in HIN to compute distance between documents.
Instead of burdening user to define meaningful meta-paths, an
automatic method is proposed to rank the meta-paths. Given
the meta-paths associated with ranking scores, an HIN-based
similarity measure, KnowSim, is proposed to compute document
similarities. Using Freebase, a well-known world knowledge base,
to conduct semantic parsing and construct HIN for documents,
our experiments on 20Newsgroups and RCV1 datasets show that
KnowSim generates impressive high-quality document clustering.
I. INTRODUCTION
Document similarity is a fundamental task, and can be used
in many applications such as document classification, clus-
tering and ranking. Traditional approaches use bag-of-words
(BOW) as document representation and compute the document
similarities using different measures such as cosine, Jaccard,
and dice. However, the entity phrases rather than just words
in documents can be critical for evaluating the relatedness
between texts. For example, “New York” and “New York
Times” represent different meanings. “George Washington”
and “Washington” are similar if they both refer to person, but
can be rather different otherwise. If we can detect their names
and types (coarse-grained types such as person, location and
organization; fine-grained types such as politician, musician,
country, and city), they can help us better evaluate whether two
documents are similar. Moreover, the links between entities
or words are also informative. For example, as Fig. 1 shown
in [1], the similarity between the two documents is zero if
we use BOW representation since there is no identical word
shared by them. However, the two documents are related in
contents. If we can build a link between “Obama” of type
Politician in one document and “Bush” of type Politician in
another, then the two documents become similar in the sense
that they both talk about politicians and connect to “United
States.” Therefore, we can use the structural information in the
unstructured documents to further improve document similarity
computation.
Some existing studies use linguistic knowledge bases such
as WordNet [2] or general purpose knowledge bases such
as Open Directory Project (ODP) [3], Wikipedia [4], [5],
[6], [7], [8], [9], or knowledge extracted from open domain
data such as Probase [10], [11], to extend the features of
documents to improve similarity measures. However, they treat
knowledge in such knowledge bases as “flat features” and do
not consider the structural information contained in the links
in knowledge bases. There have been studies on evaluating
word similarity or string similarity based on WordNet or other
knowledge [12] considering the structural information [13],
and using word similarity to compute short text similarity [14],
[15]. For example, the distance from words to the root is
used to capture the semantic relatedness between two words.
However, WordNet is designed for single words. For named
entities, a separate similarity should be designed [14], [16].
These studies do not consider the relationships between entities
(e.g., “Obama” being related to “United States”). Thus, they
may still lose structural information even if the knowledge
base provides rich linked information. For example, nowadays
there exist numerous general-purpose knowledge bases, e.g.,
Freebase [17], KnowItAll [18], TextRunner [19], WikiTax-
onomy [20], DBpedia [21], YAGO [22], NELL [23] and
Knowledge Vault [24]. They contain a lot of world knowledge
about entity types and their relationships and provide us rich
opportunities to develop a better measure to evaluate document
similarities.
In this paper, we propose KnowSim, a heterogeneous
information network (HIN) [25] based similarity measure that
explores the structural information from knowledge bases to
compute document similarities. We use Freebase as the source
of world knowledge. Freebase is a collaboratively collected
knowledge base about entities and their organizations [17]. We
follow [1] to use the world knowledge specification framework
including a semantic parser to ground any text to the knowl-
edge bases, and a conceptualization-based semantic filter to re-
solve the ambiguity problem when adapting world knowledge
to the corresponding document. By the specification of world
knowledge, we have the documents as well as the extracted
entities and their relations. Since the knowledge bases provide
entity types, the resulting data naturally form an HIN. The
named entities and their types, as well as the documents and
the words form the HIN.
Given a constructed HIN, we use meta-path based simi-
larity [26] to measure the similarity between two documents