词嵌入到文档距离：Word Mover's Distance 原理解析

需积分: 9 88 浏览量更新于2024-09-09 1 收藏 1.14MB PDF 举报

"WMD（原论文）是关于Word Mover's Distance的详细介绍，由Matt J. Kusner等人撰写，发表于华盛顿大学。该论文提出了一种新的文本文档距离度量方法，即WMD，它基于词嵌入技术来衡量两篇文本之间的差异。WMD利用词的语义意义表示，计算文档中单词在嵌入空间中移动到另一文档单词所需的最小距离，从而定义了文档间的相似性。此外，论文还指出WMD可以转化为地球搬运工距离问题，并且可以利用高效的优化算法求解。WMD无需超参数调整，易于实现。实验表明，在多个真实世界的文档分类任务中，WMD相对于七个最先进的基准方法表现优越。" WMD，全称为Word Mover's Distance，是一种用于比较文本文档的新颖距离度量方法。这一概念源于最近在词嵌入领域的研究，这些研究通过分析句子中的单词共现关系，学习到具有语义含义的单词表示。WMD的核心思想是将文档看作是由其嵌入向量组成的单词集合，然后计算从一个文档的单词集到另一个文档的单词集的最小运输成本，这个过程可以映射为经典的地球搬运工距离问题。在地球搬运工距离（Earth Mover’s Distance，EMD）中，目标是找到一个最优化的分配方案，使得源分布的元素可以以最小的成本转移到目标分布。在WMD的上下文中，源和目标分布分别代表两个文档的词嵌入，而成本则基于嵌入空间中的欧氏距离。通过解决这个优化问题，WMD能够精确地量化两篇文档之间的语义差异。 WMD的一个显著优点是它不需要任何超参数调整，这简化了应用过程。同时，由于其与已知的优化问题——地球搬运工距离——存在联系，可以利用现有高效的算法来计算WMD。此外，WMD的直观性和直接性使其在实践中易于实现。在论文中，作者对比了WMD与其他七种先进的文本相似度计算方法，在八个真实的文档分类数据集上进行了评估。实验结果证实，WMD在大多数情况下都能提供更优的分类性能，这证明了其在理解和比较文本内容上的有效性。总结起来，WMD是一种强大的工具，尤其适用于那些需要捕捉文本深层语义相似性的任务，如文档分类、信息检索和自然语言处理中的各种应用。它的出现为理解和度量文本之间的复杂关系提供了一个新的视角，促进了文本分析领域的进步。

From Word Embeddings To Document Distances

Matt J. Kusner MKUSNER@WUSTL.EDU

Yu Sun YUSUN@WUSTL.EDU

Nicholas I. Kolkin N.KOLKIN@WUSTL.EDU

Kilian Q. Weinberger KILIAN@WUSTL.EDU

Washington University in St. Louis, 1 Brookings Dr., St. Louis, MO 63130

Abstract

We present the Word Mover’s Distance (WMD),

a novel distance function between text docu-

ments. Our work is based on recent results in

word embeddings that learn semantically mean-

ingful representations for words from local co-

occurrences in sentences. The WMD distance

measures the dissimilarity between two text doc-

uments as the minimum amount of distance that

the embedded words of one document need to

“travel” to reach the embedded words of another

document. We show that this distance metric can

be cast as an instance of the Earth Mover’s Dis-

tance, a well studied transportation problem for

which several highly efﬁcient solvers have been

developed. Our metric has no hyperparameters

and is straight-forward to implement. Further, we

demonstrate on eight real world document classi-

ﬁcation data sets, in comparison with seven state-

of-the-art baselines, that the WMD metric leads

to unprecedented low k-nearest neighbor docu-

ment classiﬁcation error rates.

1. Introduction

Accurately representing the distance between two docu-

ments has far-reaching applications in document retrieval

(Salton & Buckley, 1988), news categorization and cluster-

ing (Ontrup & Ritter, 2001; Greene & Cunningham, 2006),

song identiﬁcation (Brochu & Freitas, 2002), and multi-

lingual document matching (Quadrianto et al., 2009).

The two most common ways documents are represented

is via a bag of words (BOW) or by their term frequency-

inverse document frequency (TF-IDF). However, these fea-

tures are often not suitable for document distances due to

Proceedings of the 32

International Conference on Machine

Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-

right 2015 by the author(s).

‘Obama’

word2vec embedding

‘President’

‘speaks’

‘Illinois’

‘media’

‘greets’

‘press’

‘Chicago’

document 2document 1

Obama

speaks

the

media

Illinois

The

President

greets

the

press

Chicago

Figure 1. An illustration of the word mover’s distance. All

non-stop words (bold) of both documents are embedded into a

word2vec space. The distance between the two documents is the

minimum cumulative distance that all words in document 1 need

to travel to exactly match document 2. (Best viewed in color.)

their frequent near-orthogonality (Sch

olkopf et al., 2002;

Greene & Cunningham, 2006). Another signiﬁcant draw-

back of these representations are that they do not capture

the distance between individual words. Take for example

the two sentences in different documents: Obama speaks

to the media in Illinois and: The President greets the press

in Chicago. While these sentences have no words in com-

mon, they convey nearly the same information, a fact that

cannot be represented by the BOW model. In this case, the

closeness of the word pairs: (Obama, President); (speaks,

greets); (media, press); and (Illinois, Chicago) is not fac-

tored into the BOW-based distance.

There have been numerous methods that attempt to circum-

vent this problem by learning a latent low-dimensional rep-

resentation of documents. Latent Semantic Indexing (LSI)

(Deerwester et al., 1990) eigendecomposes the BOW fea-

ture space, and Latent Dirichlet Allocation (LDA) (Blei

et al., 2003) probabilistically groups similar words into top-

ics and represents documents as distribution over these top-

ics. At the same time, there are many competing vari-

ants of BOW/TF-IDF (Salton & Buckley, 1988; Robert-

son & Walker, 1994). While these approaches produce a

more coherent document representation than BOW, they

often do not improve the empirical performance of BOW

on distance-based tasks (e.g., nearest-neighbor classiﬁers)

(Petterson et al., 2010; Mikolov et al., 2013c).

下载后可阅读完整内容，剩余9页未读，立即下载

小智Robo

粉丝: 19

词嵌入到文档距离：Word Mover's Distance 原理解析

串口调试助手3.0版

PCI9054驱动程序代码 武安河WMD驱动程序设计例子 可以直接使用

wmd86唐都仪器连接实验仪器

wmd_editor.rar_wmd editor_wmd-edit

matlab经典小代码-wmd:wmd

wmd markdown事例

MATLAB不同版本代码区别-wmd:单词移动器与MatthewJKusner的论文“从单词嵌入到文档距离”的距离

Wmd5cCS22

wmd编辑器绝对经典

django-wmd-editor:适用于Django应用的可插拔WMD编辑器包装器

最新资源

PCI9054驱动程序代码武安河WMD驱动程序设计例子可以直接使用