词嵌入探秘：理解神经网络中的类比关系

自然语言处理

需积分: 0 18 浏览量更新于2024-08-05 收藏 619KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"这篇文档是关于词嵌入的深度解析，特别是如何理解神经网络方法如word2vec（W2V）生成的词向量能够表现出类比关系的线性行为。作者Carl Allen和Timothy Hospedales探讨了这一现象背后的理论基础，并提出了一个新的基于概率的同义词定义，重新解释为词的转换。他们通过这个概念证明了在W2V类型的词嵌入中存在支持类比现象的线性关系，并识别出明确的误差项。文档主要关注自然语言处理领域中的词嵌入技术及其应用。" 正文: 词嵌入是现代自然语言处理（NLP）中不可或缺的一部分，它将词语转化为连续的向量空间表示，使得计算机可以理解和处理语言。Word2vec是一种著名的词嵌入方法，它利用神经网络模型捕捉词汇的上下文关联，从而生成具有丰富语义信息的词向量。这些向量具有一个显著特征：它们之间展现出类似几何的线性关系，例如，“女人对皇后如同男人对国王”的类比可以通过在向量空间中找到一个平行四边形结构来表达。以往对于这种线性的解释多依赖于假设，但这些假设并不总是符合实际。文档的作者挑战了这些传统解释，他们提出了一种基于概率的同义词新定义，将其视为一种“wx到wy”的词转换数学描述。这一概念使得他们能够形式化地证明在word2vec类型的词嵌入中确实存在支持类比关系的线性结构，并且能够明确识别出导致这种线性行为的误差项。这一理论进展对于理解词嵌入如何捕获和表示语言的复杂性至关重要，因为这些向量不仅反映了词汇之间的统计关联，还揭示了更深层次的语义关系。通过揭示这些内在的线性关系，研究者可以更好地设计和优化模型，提高NLP任务的性能，如词性标注、情感分析、机器翻译等。此外，文档的引入部分指出，以前的词嵌入通常是通过统计方法（如TF-IDF）显式构建的，而神经网络方法如word2vec则通过学习上下文窗口内的单词共现模式来自动学习这些表示。这种方法的一个优势在于，它可以捕捉到词汇的隐含意义和语境依赖性，这在传统的基于统计的嵌入方法中是难以实现的。该文档深入剖析了词嵌入模型，特别是word2vec，揭示了其内在的线性关系和类比能力。这一研究不仅有助于我们更深刻地理解词嵌入的工作原理，也为未来改进NLP模型提供了理论基础。通过理解这些基本机制，我们可以期望开发出更准确、更适应语言多样性的自然语言处理系统。

资源详情

资源推荐

Analogies Explained: Towards Understanding Word Embeddings

Carl Allen

Timothy Hospedales

Abstract

Word embeddings generated by neural network

methods such as word2vec (W2V) are well known

to exhibit seemingly linear behaviour, e.g. the

embeddings of analogy “woman is to queen as

man is to king” approximately describe a paral-

lelogram. This property is particularly intriguing

since the embeddings are not trained to achieve

it. Several explanations have been proposed, but

each introduces assumptions that do not hold in

practice. We derive a probabilistically grounded

deﬁnition of paraphrasing that we re-interpret

as word transformation, a mathematical descrip-

tion of “

is to

”. From these concepts we

prove existence of linear relationships between

W2V-type embeddings that underlie the analogi-

cal phenomenon, identifying explicit error terms.

1. Introduction

The vector representation, or embedding, of words under-

pins much of modern machine learning for natural language

processing (e.g. Turney & Pantel (2010)). Where, previ-

ously, embeddings were generated explicitly from word

statistics, neural network methods are now commonly used

to generate neural embeddings that are of low dimension

relative to the number of words represented, yet achieve

impressive performance on downstream tasks (e.g. Turian

et al. (2010); Socher et al. (2013)). Of these, word2vec

(W2V) (Mikolov et al., 2013a) and Glove (Pennington et al.,

2014) are amongst the best known and on which we focus.

Interestingly, such embeddings exhibit seemingly linear be-

haviour (Mikolov et al., 2013b; Levy & Goldberg, 2014a),

e.g. the respective embeddings of analogies, or word rela-

tionships of the form “

is to

∗

is to

∗

”, often

satisfy

∗

−w

+ w

≈ w

∗

, where

is the embedding

School of Informatics, University of Edinburgh. Correspondence

to: Carl Allen <carl.allen@ed.ac.uk>.

Proceedings of the

International Conference on Machine

Learning, Long Beach, California, PMLR 97, 2019. Copyright

2019 by the author(s).

Throughout, we refer to the more commonly used Skipgram im-

plementation of W2V with negative sampling (SGNS).

of word

. This enables analogical questions such as “man

is to king as woman is to ..?” to be solved by vector addi-

tion and subtraction. Such high order structure is surprising

since word embeddings are trained using only pairwise word

co-occurrence data extracted from a text corpus.

We ﬁrst show that where embeddings factorise pointwise mu-

tual information (PMI), it is paraphrasing that determines

when a linear combination of embeddings equates to that of

another word. We say

king

paraphrases

man

and

royal

, for

example, if there is a semantic equivalence between

king

and

{man, royal}

combined. We can measure such equiva-

lence with respect to probability distributions over nearby

words, in line with Firth’s maxim “You shall know a word

by the company it keeps” (Firth, 1957). We then show that

paraphrasing can be reinterpreted as word transformation

with additive parameters (e.g. from

man

king

by adding

royal

) and generalise to also allow subtraction. Finally, we

prove that by interpreting an analogy “

is to

∗

is to

∗

” as word transformations

∗

and

∗

sharing the same parameters, the linear relationship

observed between word embeddings of analogies follows

(see overview in Fig 4). Our key contributions are:

•

to derive a probabilistic deﬁnition of paraphrasing and

show that it governs the relationship between one (PMI-

derived) word embedding and any sum of others;

•

to show how paraphrasing can be generalised and inter-

preted as the transformation from one word to another,

giving a mathematical formulation for “w

is to w

∗

”;

•

to provide the ﬁrst rigorous proof of the linear relation-

ship between word embeddings of analogies, including

explicit, interpretable error terms; and

•

to show how these relationships materialise between

vectors of PMI values, and so too in word embeddings

that factorise the PMI matrix, or approximate such a

factorisation e.g. W2V and Glove.

2. Previous Work

Intuition for the presence of linear analogical relationships,

or linguistic regularity, amongst word embeddings was ﬁrst

suggested by Mikolov et al. (2013a;b) and Pennington et al.

(2014), and has been widely discussed since (e.g. Levy &

Goldberg (2014a); Linzen (2016)). More recently, several

theoretical explanations have been proposed:

下载后可阅读完整内容，剩余8页未读，立即下载

李诗旸

粉丝: 31
资源: 329

词嵌入探秘：理解神经网络中的类比关系

Yoav Goldberg：word embeddings what, how and whither

word2viz:词嵌入中语义相似性的可视化

word2vec训练过程中的损失和精度怎么用python写

simca怎么聚类分析

java基于ssm+vue物流配送管理系统源码 带毕业论文+PPT

java基于ssm+jsp实验室排课系统源码 带毕业论文

VB人事管理系统（源代码+论文）.zip

【高创新】基于龙格库塔优化算法RUN-Transformer-BiLSTM实现故障识别Matlab实现.rar

Manjaro_Linux常用软件一键安装脚本，适用于Arch_Linux的各个分枝。

java-ssm+vue高校网课管理系统实现源码(项目源码-说明文档)

ACM计算两整数相加的多语言代码示例与注解

VB+ACCESS学生公寓管理系统(源代码+系统).zip

c语言课程设计-职工资源管理系统.7z

小程序-地图定位（源码）.zip

java基于ssm+jsp明嘉新材料公司仓库管理系统源码 带毕业论文

Qt步进电机上位机控制程序源代码Qt跨平台C C++语言编写 支持串口Tcp网口Udp网络三种端口类型 提供，提供详细注释和人工

VB+ACcess学生成绩管理系统(论文+系统+答辩PP).zip

DAY29使用的MATLAB实时脚本文件

VB+ACCESSVCD租借管理系统(系统+论文+需要分析).zip

【高创新】基于斑马优化算法ZOA-Transformer-LSTM实现故障识别Matlab实现.rar

最新资源

java基于ssm+vue物流配送管理系统源码带毕业论文+PPT

java基于ssm+jsp实验室排课系统源码带毕业论文

java基于ssm+jsp明嘉新材料公司仓库管理系统源码带毕业论文

Qt步进电机上位机控制程序源代码Qt跨平台C C++语言编写支持串口Tcp网口Udp网络三种端口类型提供，提供详细注释和人工