基于丰富特征的SVR模型在语义文本相似度计算中的应用

136 浏览量更新于2024-08-27 收藏 142KB PDF 举报

"本文介绍了在2017年SemEval国际研讨会的Semantic Textual Similarity (STS)任务中提出的一种基于丰富特征的支持向量回归（SVR）系统——ITNLP-AiKF。该系统用于计算英文双语对的语义相似度，并在比赛中取得了0.8231的皮尔逊相关系数，表现优秀。" 本文详细探讨了如何利用丰富的特征来计算语义文本相似度，这是自然语言处理中的一个关键问题。作者提出的ITNLP-AiKF系统在2017年SemEval的STS任务1中展示了其有效性和竞争力。这个任务的目标是评估两个句子在底层语义上的等价程度。系统的主要特点在于其采用了一组多样化的特征，包括： 1. **本体基础特征**：利用语义网络如WordNet等来捕获词汇之间的语义关系，这有助于识别词汇的深层含义和上下文关系。 2. **词嵌入基础特征**：通过预训练的词嵌入模型（如Word2Vec或GloVe）将单词转化为连续的向量空间表示，以便捕捉词汇间的语义和语法相似性。 3. **语料库基础特征**：基于大规模语料库统计的特征，例如共现频率、n-gram匹配等，反映句子在语言环境中的共同出现情况。 4. **对齐基础特征**：通过对句子进行词汇和短语的对应，找出潜在的相似部分，帮助识别句子间的核心信息。 5. **字面基础特征**：直接比较句子的表面结构，如词汇顺序、词性标注等，为判断语义相似度提供辅助信息。这些特征被整合到支持向量回归（SVR）模型中，以预测两个句子的相似度。SVR是一种监督学习方法，能够处理非线性关系并有效地找到最佳决策边界。通过训练数据，SVR模型学习如何根据输入特征来预测输出的相似度分数。在实验结果中，ITNLP-AiKF系统在SemEval 2017的STS任务中获得了0.8231的皮尔逊相关系数，这是一个高度相关的分数，表明系统在预测语义相似度方面有很好的性能。皮尔逊相关系数衡量的是两个变量之间的线性相关性，值越接近1表示相关性越高。这篇研究论文展示了丰富特征结合支持向量回归在计算语义文本相似度方面的潜力，为自然语言处理领域提供了新的思路和工具。这种技术对于信息检索、问答系统、机器翻译等多个应用具有重要意义，能够提高系统理解和解释文本的能力。

Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), pages 159–163,

Vancouver, Canada, August 3 - 4, 2017.

2017 Association for Computational Linguistics

ITNLP-AiKF at SemEval-2017 Task 1: Rich Features Based SVR for

Semantic Textual Similarity Computing

Wenjie Liu, Chengjie Sun, Lei Lin and Bingquan Liu

School of Computer Science and Technology

Harbin Institute of Technology

Harbin, China

{wjliu, cjsun, linl, liubq}@insun.hit.edu.cn

Abstract

Semantic Textual Similarity (STS) devotes

to measuring the degree of equivalence in

the underlying semantic of the sentence

pair. We proposed a new system, ITNLP-

AiKF, which applies in the SemEval 2017

Task1 Semantic Textual Similarity track 5

English monolingual pairs. In our system,

rich features are involved, including On-

tology based, word embedding based, Cor-

pus based, Alignment based and Literal

based feature. We leveraged the features to

predict sentence pair similarity by a Sup-

port Vector Regression (SVR) model. In

the result, a Pearson Correlation of 0.8231

is achieved by our system, which is a com-

petitive result in the contest of this track.

1 Introduction

Semantic Evaluation (SemEval) contest devotes to

pushing the research of semantic analysis, which

attracts many participants and promote a lot of

groundbreaking achievements in natural language

processing (NLP) ﬁeld. Semantic textual simi-

larity (STS) task works for computing word and

text semantics, which has made extensive attrac-

tion to the researchers and NLP community since

SemEval 2012 (Agirre et al., 2012).

In STS 2017, The organizers added monolin-

gual Arabic and Cross-lingual Arabic-English se-

mantic calculation in order to increase the difﬁ-

culty in the contest. The task deﬁnition is given

two sentences participating systems are asked to

predict a continuous similarity score on a scale

from 0 to 5 of the sentence pair, with 0 indicating

that the semantics of the sentences completely in-

dependent and 5 semantic equivalence. The eval-

uation criterion uses Pearson Correlation Coefﬁ-

cient, which computing the correlation between

golden standard scores and semantic system pre-

dicted scores.

In our system, in order to predict similarity

score of two sentences, we trained a Support Vec-

tor Regression (SVR) model with abundant fea-

tures including Ontology based features, Word

Embedding based features, Corpus based features,

Alignment based features and Literal based fea-

tures. All the English training, trial and evalua-

tion data set released by previous STS tasks in Se-

mEval were used to construct our system. Our best

system achieved 0.8231 Pearson Correlation coef-

ﬁcient in the SemEval 2017 evaluation data set,

and the winner achieved 0.8547.

2 Feature Engineering

In our system, many features are tried to promote

the performance of our system. Five kinds of fea-

tures are used: Ontology based features, Word

Embedding based features, Corpus based features,

Alignment based features and Literal based fea-

tures.The following is a detailed description of the

ﬁve kinds features.

2.1 Ontology Based Features

WordNet (Miller, 1995) is used to exploit On-

tology based features. WordNet is a large

lexical database of English. In WordNet,

nouns, verbs, adjectives and adverbs are di-

vided into sets of cognitive synonyms called

synsets. Each synonym expresses a distinct

concept. WordNet measures two words sim-

ilarity based on Path similarity, Res similarity,

Lin similarity, Wup similarity, Lch similarity and

so on. In our system, we directly use WordNet

APIs provided by NLTK toolkit (Bird, 2006) to

calculate the similarity of two words.

Path similarity measure is based on the shortest

path similarity measure. The Path similarity for-

159

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38675969

粉丝: 2
资源: 957

基于丰富特征的SVR模型在语义文本相似度计算中的应用

FBMC-channel-estimation-based-on-SVR-master.zip_FBMC_SVR_channel

SVR_SVR模型保存_svr预测_SVR_SVR回归预测_wordsrt_

Fusion of two typical quantitative steganalysis based on SVR

Online-SVR for Vehicular Position Prediction during GPS Outages using Low-cost INS

SVR learning-based spatiotemporal fuzzy logic controller for nonlinear spatially distributed dynamic systems

Robust prognostics for state of health estimation of lithium-ion batteries based on animproved PSO-SVR model

SVR.rar_ svr_Matlab SVR_SVR_SVR matlab_svr matlab

local svr local svr

SVR简明版SVR

SVR.rar_Matlab SVR_SVR_SVR matlab_支持向量机svr

最新资源