ITNLP-AiKF at SemEval-2017 Task 1: Rich Features Based SVR for
Semantic Textual Similarity Computing
Wenjie Liu, Chengjie Sun, Lei Lin and Bingquan Liu
School of Computer Science and Technology
Harbin Institute of Technology
Harbin, China
{wjliu, cjsun, linl, liubq}@insun.hit.edu.cn
Abstract
Semantic Textual Similarity (STS) devotes
to measuring the degree of equivalence in
the underlying semantic of the sentence
pair. We proposed a new system, ITNLP-
AiKF, which applies in the SemEval 2017
Task1 Semantic Textual Similarity track 5
English monolingual pairs. In our system,
rich features are involved, including On-
tology based, word embedding based, Cor-
pus based, Alignment based and Literal
based feature. We leveraged the features to
predict sentence pair similarity by a Sup-
port Vector Regression (SVR) model. In
the result, a Pearson Correlation of 0.8231
is achieved by our system, which is a com-
petitive result in the contest of this track.
1 Introduction
Semantic Evaluation (SemEval) contest devotes to
pushing the research of semantic analysis, which
attracts many participants and promote a lot of
groundbreaking achievements in natural language
processing (NLP) field. Semantic textual simi-
larity (STS) task works for computing word and
text semantics, which has made extensive attrac-
tion to the researchers and NLP community since
SemEval 2012 (Agirre et al., 2012).
In STS 2017, The organizers added monolin-
gual Arabic and Cross-lingual Arabic-English se-
mantic calculation in order to increase the diffi-
culty in the contest. The task definition is given
two sentences participating systems are asked to
predict a continuous similarity score on a scale
from 0 to 5 of the sentence pair, with 0 indicating
that the semantics of the sentences completely in-
dependent and 5 semantic equivalence. The eval-
uation criterion uses Pearson Correlation Coeffi-
cient, which computing the correlation between
golden standard scores and semantic system pre-
dicted scores.
In our system, in order to predict similarity
score of two sentences, we trained a Support Vec-
tor Regression (SVR) model with abundant fea-
tures including Ontology based features, Word
Embedding based features, Corpus based features,
Alignment based features and Literal based fea-
tures. All the English training, trial and evalua-
tion data set released by previous STS tasks in Se-
mEval were used to construct our system. Our best
system achieved 0.8231 Pearson Correlation coef-
ficient in the SemEval 2017 evaluation data set,
and the winner achieved 0.8547.
2 Feature Engineering
In our system, many features are tried to promote
the performance of our system. Five kinds of fea-
tures are used: Ontology based features, Word
Embedding based features, Corpus based features,
Alignment based features and Literal based fea-
tures.The following is a detailed description of the
five kinds features.
2.1 Ontology Based Features
WordNet (Miller, 1995) is used to exploit On-
tology based features. WordNet is a large
lexical database of English. In WordNet,
nouns, verbs, adjectives and adverbs are di-
vided into sets of cognitive synonyms called
synsets. Each synonym expresses a distinct
concept. WordNet measures two words sim-
ilarity based on Path similarity, Res similarity,
Lin similarity, Wup similarity, Lch similarity and
so on. In our system, we directly use WordNet
APIs provided by NLTK toolkit (Bird, 2006) to
calculate the similarity of two words.
Path similarity measure is based on the shortest
path similarity measure. The Path similarity for-