解析word2vec的负采样方法：公式详解与应用指南

需积分: 18 74 浏览量更新于2024-09-06 收藏 102KB PDF 举报

标题 "[W2V] Negative-Sampling Word-Embedding Method.pdf" 的文章主要探讨了在词嵌入技术word2vec中，特别是skip-gram模型中使用的负采样（negative sampling）方法。skip-gram模型是一种流行的预训练语言模型，旨在捕捉词汇之间的语义和上下文关系。该模型的核心是计算给定中心词（context words）w的情况下，目标词（target word）c出现的概率。负采样方法在word2vec中是关键步骤，它简化了计算密集的softmax函数，使得大规模数据集下的训练成为可能。原始的softmax计算每个词作为目标词的可能性，对于大型词汇表来说非常耗费资源。负采样通过以下步骤实现： 1. **目标词分布**：论文中提到的方程(4)涉及到目标词c的真实概率分布P(c|w)，这个概率是基于整个词汇表的。然而，实际计算中，我们只需要对正样本c进行计算，其他词则通过负采样来近似。 2. **负样本生成**：为了减少计算复杂性，负采样会选择一个大小为k的随机集合，其中包含了k个与目标词c不太相关的单词作为负样本。这些单词的选择通常基于词频或者逆文档频率（IDF），以确保选择的词与目标词的联系较弱。 3. **概率估计**：在负采样中，对于每一个正样本c，会生成k个负样本，并且计算目标词c出现在给定上下文w的概率，以及这k个负样本在同样上下文中出现的概率。这样做的目的是为了近似真实条件概率P(c|w)。 4. **损失函数**：通过对比正样本和负样本的概率，计算出损失函数，然后最小化这个函数以更新词嵌入。这个损失函数通常采用对数似然函数的形式，其中正样本被赋予正权重，而负样本被赋予较小的负权重，这样可以更有效地优化。 5. **效率提升**：负采样显著减少了计算量，使得大规模的训练成为可能，同时还能保持词嵌入的高质量，因为负样本的选择有助于模型学习到词汇的分布式表示，即相似词在低维空间中的接近性。总结来说，"[W2V] Negative-Sampling Word-Embedding Method.pdf"深入解释了如何利用负采样技术在skip-gram模型中高效地训练词嵌入，这是word2vec方法成功的关键组成部分。通过理解这个过程，研究者和开发者能够更好地利用word2vec进行文本分析和自然语言处理任务。

arXiv:1402.3722v1 [cs.CL] 15 Feb 2014

word2vec Explained: Deriving Mikolov et al.’s

Negative-Sampling Word-Embedding Method

Yoav Goldberg and Omer Levy

{yoav.goldberg,omerlevy}@gmail.com

February 14, 2014

The word2vec software of Tomas Mikolov and colleagues

has gained a lot

of traction lately, and provides state-of-the-art word embeddings. The lea rning

models behind the software are described in two research papers [1, 2]. We

found the description of the models in these papers to be somewhat cryptic

and ha rd to follow. While the motivations and presentation may be obvious to

the neural-networks language-modeling crowd, we had to struggle quite a bit to

ﬁgure out the rationale behind the equations.

This note is an attempt to explain equation (4) (negative sampling) in “Dis-

tributed Representations of Words and Phrases and their Co mpositionality” by

Tomas Mikolov, Ilya Sutskeve r, K ai Chen, Greg Corrado and Jeﬀrey Dean [2].

1 The skip-gram model

The departure point of the paper is the skip-gra m model. I n this model we are

given a corpus of words w and their contexts c. We consider the conditional

probabilities p (c|w), and g iven a corpus T ext, the goal is to set the parameter s

θ of p(c|w; θ) so as to maximize the corpus probability:

arg max

w∈T ext





c∈C(w)

p(c|w; θ)





(1)

in this equation, C(w) is the set of contexts of word w. Alter natively:

arg max

(w,c)∈D

p(c|w; θ) (2)

here D is the se t of all word and context pairs we extract from the text.

https://code.google.com/p/word2vec/

下载后可阅读完整内容，剩余4页未读，立即下载

tersisFu

粉丝: 0
资源: 3

解析word2vec的负采样方法：公式详解与应用指南

Chapter-3---Sampling-Quantization-2.rar_The Signal

Chapter-3---Sampling-Quantization-3.rar_The Signal

IEC 60475-2022 Method of sampling insulating liquids.pdf

Python库 | NlpToolkit-Sampling-1.0.1.tar.gz

sap-me-sampling-how-to-guide-en.pdf

PyPI 官网下载 | NlpToolkit-Sampling-1.0.3.tar.gz

Chapter-3---Sampling-Quantization.rar_Signal quantization_The Si

T-REC-H.263-199603-S!!PDF-E.pdf

reparation-and-sampling-of-an-image.rar_图形图像处理_matlab_

3D-mesh-blue-noise-sampling.zip

最新资源