111:6 D Chandrasekaran and V Mago
both structured taxonomic data and/or as a corpus for training corpus-based methods[
77
].
The complex category structure of Wikipedia is used as a graph to determine the Information
Content of concepts, which in turn aids in calculating the semantic similarity[35].
•
BabelNet[
66
] is a lexical resource that combines WordNet with data available on Wikipedia
for each synset. It is the largest multilingual semantic ontology available with nearly over
13 million synsets and 380 million semantic relations in 271 languages. It includes over four
million synsets with at least one associated Wikipedia page for the English language[19].
3.2 Types of Knowledge-based semantic similarity methods
Based on the underlying principle of how the semantic similarity between words is assessed,
knowledge-based semantic similarity methods can be further categorized as edge-counting methods,
feature-based methods, and Information content-based methods.
3.2.1
Edge-counting methods:
The most straight forward edge counting method is to consider
the underlying ontology as a graph connecting words taxonomically and count the edges between
two terms to measure the similarity between them. The greater the distance between the terms the
less similar they are. This measure called
path
was proposed by Rada et al.[
79
] where the similarity
is inversely proportional to the shortest path length between two terms. In this edge-counting
method, the fact that the words deeper down the hierarchy have a more specic meaning, and
that, they may be more similar to each other even though they have the same distance as two
words that represent a more generic concept was not taken into consideration. Wu and Palmer[
98
]
proposed
wup
measure, where the depth of the words in the ontology was considered an important
attribute. The
wup
measure counts the number of edges between each term and their Least Common
Subsumer (LCS). LCS is the common ancestor shared by both terms in the given ontology. Consider,
two terms denoted as
t
1
, t
2
, their LCS denoted as
t
lcs
, and the shortest path length between them
denoted as min_len(t
1
, t
2
),
path is measured as,
sim
path
(t
1
, t
2
) =
1
1 + min_len(t
1
, t
2
)
(1)
and wup is measured as,
sim
wup
(t
1
, t
2
) =
2depth(t
lcs
)
depth (t
1
) + depth(t
2
)
(2)
Li et al.[
49
] proposed a measure that takes into account both the minimum path distance and
depth. li is measured as,
sim
li
= e
−αmin_len(t
1
, t
2
)
.
e
βd ept h(t
l c s
)
− e
−βdepth(t
l c s
)
e
βd ept h(t
l c s
) + e
−βdepth(t
l
cs)
(3)
However, the edge-counting methods ignore the fact that the edges in the ontologies need not
be of equal length. To overcome this shortcoming of simple edge-counting methods feature-based
semantic similarity methods were proposed.
3.2.2
Feature-based methods:
The feature-based methods calculate similarity as a function of
properties of the words, like gloss, neighboring concepts, etc. [
92
]. Gloss is dened as the meaning
of a word in a dictionary; a collection of glosses is called glossary. There are various semantic
similarity methods proposed based on the gloss of words. Gloss-based semantic similarity measures
exploit the knowledge that words with similar meaning have more common words in their gloss.
The semantic similarity is measured as the extent of overlap between the gloss of the words in
consideration. The Lesk measure[
10
], assigns a value of relatedness between two words based
on the overlap of words in their gloss and the glosses of the concepts they are related to in an
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2020.