向量空间模型中文本相似性的比较研究

需积分: 10 164 浏览量更新于2024-07-18 收藏 681KB PDF 举报

"这篇研究论文‘Text Similarity in Vector Space Models: A Comparative Study’探讨了在自然语言处理中自动衡量语义文本相似性的重要性。作者评估了不同向量空间模型在执行此任务时的表现，包括TF-IDF及其扩展、主题模型（如潜在语义索引）和神经网络模型（如段落向量）。实验集中在专利与专利之间的相似性建模，并对比了各种方法的性能。" 在文本相似度计算中，向量空间模型是关键工具。这些模型将文本转化为数学向量，使得我们可以量化和比较文本间的相似程度。TF-IDF（词频-逆文档频率）是一种经典的向量表示方法，它通过结合单词在文档中的出现频率和在整个文集中的普遍性来创建向量。TF-IDF的优势在于它可以过滤掉常见但不具区分性的词汇，突出具有文档特异性的词汇。本研究比较了TF-IDF及其变体，例如可能的扩展，这些扩展试图改进TF-IDF的基本框架。此外，还考虑了主题模型，如潜在语义索引（LSI），它通过降维技术捕捉文本中的隐含主题结构。LSI和其他主题模型可以捕获单词之间的上下文关系，但计算成本较高。另一类模型是神经网络模型，尤其是段落向量（如Doc2Vec），这些模型能学习到更丰富的上下文信息，生成更复杂的向量表示。这些模型在处理短文本和简单相似度比较时，其优势更为明显，因为它们能捕获到词汇的语义关系。然而，实验结果出乎意料，对于更长、更技术性的文本或需要精细区分最近邻的场景，TF-IDF表现得相当出色。这表明，在某些情况下，TF-IDF的效率和简单性可能优于更复杂的方法，尽管这些复杂方法通常有更高的计算需求。该研究强调了在选择文本相似度计算方法时应考虑的具体场景和目标，以及不同模型在处理不同类型的文本数据时的适用性。对于实际应用，如专利检索和分析，理解这些模型的优缺点至关重要，以便选择最有效的方法来解决特定问题。

(e.g. computing k nearest neighbors[14]), it ignores n-gram phrases, and all IDF

weights might need to be updated upon the addition of new documents. The

basic model, however, can be extended in several ways to avoid some of these

pitfalls. We consider two recently proposed extensions in this study.

First, we consider adding certa in n-grams to the term vocabulary. N-grams

allow for the combination of terms into higher-level concepts, which may be

particularly important for research in computational social sc ie nces including

patent research [2]. Adding n-grams blindly, howeve r, would vastly increa se the

size of vocabulary, and thus the number of vector dimensions. A more manage-

able approach, therefore, is to add noun phrases based on synthetic properties of

the text. We test the phrase extraction technique from [9] which extracts noun

phrases based on a patter n ba sed method. They extend the simple noun phrase

grammar of formula 2 to supp ort better coordination of noun phrases a nd better

handling of textual tags. A ﬁnite state tr ansducer is used to extract text por-

tions that match the input grammar , inc luding nested and overlapping parts,

from the input text which is marked by part of speech (POS) tags. They impose

no upper bound for the size of extracted phrases and show that their method

extract high quality noun phrases eﬃciently.

Noun Phrase ≃ (Adj. | Noun) ∗ Noun(Pre p.Det. ∗ (Adj. | Noun) ∗ Noun)∗ (2)

A second, and separate, extension takes advantage o f the timing information

of pa tents to implement incremental IDF [10]. More speciﬁcally, whenever a new

document is added to the corpus, the corresponding IDF at that point in time is

calculated based on the current state of the total corpus (see formula 3, where T

and D

are the addition time of a new document to the corpus and the available

corpus at time T respectively). Therefore, a term would have a low IDF w hen it is

ﬁrst introduced into the vocabulary and high diﬀerentiating power; and the IDF

would attenuate over time as use of the term became more common. An example

would be a niche term fo r an emerging technology, where the term would have a

very high importance at the time of ﬁling the patent, but the term would reduce

in importance over time. As a convenient side property, incremental calculation

of IDFs also avoids the need to update all TFIDF vectors upon addition of a

new document to the corpus.

TFIDF

t,d,T

= TF

t,d

· log

| + 1

t,D

+ 1

(3)

Topic Models. Topic models transform a text into a ﬁxed size vector, equal to a

given number of latent topics. The vector repr esents the probability distribution

that the focal text relates to each of the diﬀerent topics. In practice, each topic

is a weighted average of a subset of terms. Similar to TFIDF, topic models treat

the text as a bag of words where order of words is ignored. On the down side,

interpretation of each topic can be subjective and determining the right numb e r

of topics requires tuning of the model.

剩余16页未读，继续阅读

weixin_44276261

粉丝: 1
资源: 49

向量空间模型中文本相似性的比较研究

Visual Similarity Duplicate Image Finder Pro：高效查找相似图片工具

Vector AI平台：打造矢量数据处理与搜索的强大工具

pg_similarity：PostgreSQL中的相似性查询扩展

Text-Similarity-Web-App:简单的Web应用程序可查找用户输入的两段文本之间的相似性

The impact of market similarity on international marketing strategies: The automobile industry in Western Europe

image-similarity-with-lire:图像相似度

Cosine-Similarity-with-MapReduce:MapReduce 的余弦相似度

Blasius Similarity Solution using Shooting Method:Blasius Similarity Solution using the Finite-Difference Method (Shooting Method)-matlab开发

Similarity-Guided-Sampling:具有自适应时间特征分辨率（CVPR2021）的3D CNN

short-sentence-similarity-nlp-model:该模型主要是用于短文本语义相似度匹配场景

最新资源