would nonlinearly decrease as the shortest path connected them increases. Therefore, it would be reasonable to expect that
the similarity decreases at an exponential rate as the shortest path increases, and f
1
is defined by:
f
1
ðlÞ¼e
a
l
ð2Þ
where
a
is a real constant between 0 and 1. From (2) we can see that when the path length decreases to zero, the similarity
would monotonically increase toward 1. While the path length increases infinitely, the similarity should monotonically de-
crease to 0. However, only the shortest path for semantic similarity calculation may be not so accurate, the shortest path
length method must be revised by adding more information from the hierarchical semantic structure of WordNet. It is intu-
itive that concepts at higher levels of the hierarchy have more general information, while concepts at lower levels have more
concrete semantics. Thus, the depth of concept in the hierarchy should be taken into account. The depth h of the subsumer is
derived by calculation the shortest length of links from the subsumer to the root concept of the ontology. According to this
observation, the depth function to similarity is defined by:
f
2
ðhÞ¼
e
bh
e
bh
e
bh
þ e
bh
ð3Þ
where b > 0 is a smoothing factor. Also, f
2
can be considered as an extension of Shepard’s law [39], which claims that expo-
nential-decay functions are a universal law of stimulus generalization for psychological science. We have achieved the
semantic similarity between two concepts based on the thesaurus method by far. However, the common corpus-based
(or information-based) method [32] is a rather difficult to tackle. We cannot easily access it solely from the semantic nets.
But it can be calculated with the help of a large corpus [21]. The Brown Corpus [10] is the first modern, computer readable,
general corpora. However, the scope of such corpus is also restricted for various specific data sets in practice. Meanwhile, it
takes long time to calculate the probability of encountering an instance of the concept in the large corpus. In the next section
we will propose a new corpus-based semantic similarity measure.
Since a word can be expressed by different concepts, the semantic similarity between words is then represented by the
maximum value of the similarity of concepts signified by words. Assuming word w
1
is represented by a number of a concepts
(c
1,1
, c
1,2
, ... , c
1,a
) and word w
2
is represented by a number of b concepts (c
2,1
, c
2,2
, ..., c
2,b
), the semantic similarity between
these two words is assessed by:
simðw
1
; w
2
Þ¼maxfsimðc
1
; c
2
Þg c
1
2fc
1;1
; c
1;2
; ...; c
1;a
g; c
2
2fc
2;1
; c
2;2
; ...; c
2;b
gð4Þ
Hence, the semantic similarity between these two documents is defined by:
sim
ONTO
ðd
1
; d
2
Þ¼
X
m
i¼1
X
n
j¼1
simðw
1;i
; w
2;j
Þ
!,
mn ð5Þ
where m and n are the number of WordNet lexicon words included in documents d
1
and d
2
respectively. In the light of the
experimental results given by Li et al. [21], and
a
in (2) and b in (3) are set as 0.08 and 0.60 respectively. However, despite the
fact that such a thesaurus-based method is effective and could provide the semantic similarity between two individual
words, if we only use such semantic similarity in our system, here comes some restrictions in practice. e.g., a document,
in some specialized domains, does not necessarily include WordNet lexicon words or after stemming some formal words
are broken up into incomplete forms which will not be included in WordNet lexicon. Hence, some important concepts will
be lost and only the application of WordNet for semantic similarity calculation may be not so accurate. We combine the the-
saurus-based ontology with a new semantic space model (SSM) to calculate semantic similarity between pairs of document.
In next section the SSM is proposed to reveal the associated semantic relationships between documents.
3. Semantic similarity calculation based on SSM
In this part we propose and demonstrate a semantic space model (SSM) whose whole dimensions precisely simulate the
original vector space model via cosine and Euclidean distance similarity calculation, and the appropriately reduced space can
hopefully capture the true semantic relationship between documents. SSM is an automatic approach which can solve the
problems by using statistically derived conceptual indices instead of individual words. It utilizes singular value decomposi-
tion (SVD) [27,43] to decompose the large term-by-document matrix into a set of k orthogonal factors.
3.1. The proof of SSM to simulate VSM
We use document-by-term matrix D(n m) to represent the original corpus matrix, assuming there are m terms in an n
documents data set. The transpose of matrix D is then represented by the term-by-document matrix A(m n)
D ¼ A
T
ð6Þ
The singular value decomposition of A is defined as
A ¼ U
R
V
T
ð7Þ
158 W. Song et al. / Information Sciences 273 (2014) 156–170