语义服务发现中的原始语料库与结构化本体联合相似性评估

136 浏览量更新于2024-08-28 收藏 1010KB PDF 举报

本文探讨了一种结合原始语料库和结构化本体的联合语义相似性评估方法，其目的是为了提升面向语义的Web服务发现的精确度。随着自动Web服务发现的需求增长，服务用户通常通过关键词搜索，并期望获取与其功能特性相匹配的服务。为了提高匹配的准确性，研究者们开发了多种计算词汇之间语义相似性的技术，这些方法可以分为基于本体和基于语料库的两大类。本体（Ontology）- 基于方法主要利用大型本体提供的概念差异信息来衡量词汇之间的相似性，同时通过词义消歧技术处理多义词问题。然而，大多数本体专用于特定领域，词汇覆盖范围有限，这限制了它们在实际应用中的广泛适用性。例如，WordNet这样的大规模本体可能无法涵盖所有领域的术语和概念。另一方面，基于语料库的方法则依赖于词语在上下文中的分布统计信息，通过分析词在不同上下文中的共现模式来推断它们的语义关联。这些方法不依赖于预先定义的概念体系，因此具有更强的泛化能力。然而，它们可能会受到语料库大小、质量以及语言模型复杂性的影响，对新颖或罕见词汇的处理可能不够准确。作者Wei Lu、Yuan Yuan Cai、Xiaoping Che和Yuxun Lu在2015年9月26日收到这篇论文，并在2016年3月30日接受，最终于同年5月2日在线发表。他们的研究旨在克服这两种方法的局限性，通过将结构化的本体知识与丰富的语料库数据相结合，设计出一种更全面、适应性强的语义相似性评估方法。这种方法有望提高服务发现系统的准确性和效率，特别是在跨领域和多模态查询的情况下，能更好地满足用户对于语义理解的期待。本文的核心知识点包括：语义导向的Web服务匹配挑战、基于本体和语料库的语义相似性计算方法、词义消歧、本体的领域特性和局限性、以及基于分布统计的语料库方法的优势和挑战。作者提出的联合评估方法旨在弥补现有技术的不足，为面向语义的服务发现提供更为精准的支持。

chairwoman, chair, chairperson} expresses a job title ‘‘the

ofﬁcer who presides at the meetings of an organization’’.

There exist various kinds of semantic relations between

concepts in WordNet, such as hyponym/hypernym (is-a),

meronymy/holonymy (part-of, member-of, substance-of),

and antonyms. The inherited ‘‘is-a’’ relation accounts for

nearly 80 % of all the relationship types. Consequently, we

employ the ‘‘is-a’’ relationship in this work to augment the

semantic information of a given word.

In terms o f semantic property used for similarity com-

putation, WordNet-based measures can roughly fall into

four categories that are distance-based, information con-

tent-based, feature-based, and hybrid. The distance-based

measures [17] evaluate the semantic similarity of concepts

by means of different structure properties, such as path

distance between two concepts based on edge length,

depth, and density. The measures based on information

content (IC) evaluate how speciﬁc and informative a con-

cept is from the perspective of informat ion theory [35],

where higher IC value is assigned to the more concrete

concept [18]. In IC-based measures, either the frequency

counts of words in synsets are derived from additional

corpora or the intrinsic hierarchical structure of WordNet is

used to model IC of concepts. Feature-based measures

employ the intrinsic attribute information in WordNet to

construct feature sets or vectors, which involves in synsets,

glosses, and taxonomic relations. Patwardhan et al. pro-

posed two similarity measures based on gloss overlaps [3]

and cosine similarity between gloss vectors [31], respec-

tively. For semantic similarity measurement, Liu et al. [19]

took local densities as the intrinsic properties of concepts

for constructing concept vectors. Hybrid measures [12, 33]

commonly take the advantages of different computing

methods by combining path distance, IC of concepts and

features of concepts.

2.2 Distributed vector representation

In corpus-based measures, lexical vectors are used for

estimating semantic similarity between words. As an

alternative for traditional distributional vector [40], the

distributed vector represe ntation (namely, word embed-

ding) derived from deep learning techniques has signiﬁ-

cantly improved semantic similarity evaluat ion, semantic

disambiguation [15], and analogy relationship reason-

ing [27]. In this distributed vector space, the semantic and

syntactic information [26] as well as morphology [20] are

implicitly encoded into low-dimensional continuous vec-

tors by unsupervised neural network learning.

In terms of neural network models, the promising dis-

tributed representations consist of RNN vector (recurrent

neural network) [24], RNN vector (recursive neural net-

work) [38], context-aware vector [15], log-linear vector

[25], etc. Two log-linear models, i.e., CBOW and Skip-

gram proposed by Mikolov et al. [25], simplify the com-

plexity in training caused by nonlinear hidden layer in

other models. CBOW leverages the sum of continuous bag-

of-words in the context to learn a target word vector rep-

resentation, whi le the training objective of skip-gram is

predicting the representations of the words in context given

a target word. CBOW is relatively faster than Skip-gram

model; however, the latter is more discriminative for rare

words. For semantic disambiguation, Huang et al.

employed both local and global context to generate mul-

tiple prototypes of word embedding when measuring

semantic similarity [15]. Chen et al. leveraged the concept

paraphrases in WordNet to produce multiple sense vectors

for each word [7].

2.3 Semantic fusion on different levels

Structured ontology is considered more effective than

corpus which may encounter the sparseness and imbalance

of semantic information [1]. Hence, a number of studies

focus on incorporating the semantic information from

ontology into the corpus-based measures. The relevant

works conduct semantic fusion from different aspects. We

deﬁne their classiﬁcations as vector-level [4, 7], metric-

level [1, 2, 6, 42 ], and model-level [10, 41, 43] according

to the increasing granularity of semantic fusion between

corpus and ontology.

Vector-based methods directly fuse the semantic infor-

mation from ontology into corpus through vector operation

or vector extension. Bian et al. extended the original 1-of-v

word vector using the additional features extracted from

WordNet such as concept and part of speech [4]. To

combine the semantic features from WordNet and corpus,

Chen et al. replaced the distributed word vector of a word

with the averaged vector of the words in its gloss whose

cosine similarities with the target word are larger than a

threshold [7].

Metric-based methods mainly combine various

semantic similarity based on unsupervised or supervised

learning. Agirre [1] veriﬁed the supervised combination

of multiple methods can produce better result by imple-

menting a 10-fold cross-validation in ranking classiﬁca-

tion task. Alves et al. proposed a regression function

where the lexical similarity, syntactic similarity, semantic

similarity, and distributional similarity are input as fac-

tors [2]. Chaves-Gonza

lez and Martı

Nez-Gil used evolu-

tionary algorithm to optimize the unsupervised

combination of various WordNet-based similarity met-

rics [6]. Yih and Qazvinian averaged the similarity

results derived from heterogeneous vector space models

on Wikipedia, web search, thesaurus, and WordNet,

respectively [42].

Pers Ubiquit Comput (2016) 20:311–323 313

123

剩余12页未读，继续阅读

weixin_38551376

粉丝: 2
资源: 886

语义服务发现中的原始语料库与结构化本体联合相似性评估

语义相似性算法演化论文

基于本体的汉语语料库坐标关系提取

语义Web中的语义度量与本体映射 (2006年)

领域本体的创建和应用研究

ontologyEmbedding:嵌入生物本体术语

Java之词义相似度计算（语义识别、词语情感趋势、词林相似度、拼音相似度、概念相似度、字面相似度）

基于主题树的主题结构建模研究

利用语义信息提高FCA概念相似度计算的精确方法

主题树模型：深入挖掘文本流中的时空语义主题

基于丰富特征的SVR模型在语义文本相似度计算中的应用

最新资源