word2vec模型详解：参数学习过程

需积分: 21 57 浏览量更新于2024-07-20 收藏 733KB PDF 举报

"对word2vec模型的参数学习过程进行了详细解释，包括原始的连续词袋(CBOW)和skip-gram模型，以及层次softmax和负采样等优化技术。" 在自然语言处理（NLP）领域，词向量(word2vec)是一个非常重要的工具，由Mikolov等人提出，它通过将单词转换为实数向量来捕获词汇的语义信息。词向量不仅在各种NLP任务中表现出色，如词性标注、情感分析和机器翻译，而且因其能够捕捉词汇之间的语义关系而广受关注。词向量模型主要分为两种训练方法：连续词袋模型（CBOW）和skip-gram模型。CBOW模型的目标是预测一个单词的上下文，基于这个单词周围的上下文窗口中的单词。而skip-gram模型则相反，它尝试预测当前中心词，基于给定的上下文单词。这两种方法都是通过最小化预测错误来更新模型参数，从而学习到能够表示语义的词向量。在参数学习过程中，梯度下降是常用的方法。对于CBOW模型，它使用平均上下文词向量作为输入，通过反向传播计算损失函数关于每个词向量参数的梯度，并进行更新。skip-gram模型则需对每个上下文单词计算损失函数的梯度，再对中心词向量进行更新。层次softmax和负采样是优化词向量训练效率的技术。层次softmax通过构建霍夫曼树（Huffman Tree）来减少计算概率的复杂性，尤其对于高频词，可以显著提升训练速度。负采样则是随机选择一定数量的“噪声”单词，模拟真实情况下的非关联上下文，这样可以减少计算量，同时保持模型的泛化能力。除了数学推导，理解这些梯度更新方程的直观解释也非常重要。例如，CBOW模型中，更新后的词向量会使得预测上下文单词的概率更接近实际，skip-gram模型则让中心词在给定上下文出现的概率更准确。这些解释有助于非神经网络专家理解模型的工作机制。在神经网络基础部分，理解线性代数、概率论和优化算法是必不可少的。词向量的计算涉及矩阵运算，优化过程通常依赖于梯度下降法及其变种，而模型的性能评估则与概率和统计紧密相关。 word2vec模型通过学习词向量为NLP领域带来了革命性的变化。深入理解其参数学习过程，包括CBOW、skip-gram模型的训练机制，以及优化技术如层次softmax和负采样，对于利用和改进这些模型至关重要。

This is equivalent to the tensor product of x and EH, i.e.,

∂E

∂W

= x ⊗ EH = xEH

(14)

from which we obtain a V × N matrix. Since only one component of x is non-zero, only

one row of

∂E

∂W

is non-zero, and the value of that row is EH

, an N -dim vector. We obtain

the update equation of W as

(new)

= v

(old)

− ηEH

(15)

where v

is a row of W, the “input vector” of the only context word, and is the only row

of W whose derivative is non-zero. All the other rows of W will remain unchanged after

this iteration, because their derivatives are zero.

Intuitively, since vector EH is the sum of output vectors of all words in vocabulary

weighted by their prediction error e

= y

− t

, we can understand (15) as adding a portion

of every output vector in vocabulary to the input vector of the context word. If, in the

output layer, the probability of a word w

being the output word is overestimated (y

> t

then the input vector of the context word w

will tend to move farther away from the output

vector of w

; conversely if the probability of w

being the output word is underestimated

< t

), then the input vector w

will tend to move closer to the output vector of w

;

if the probability of w

is fairly accurately predicted, then it will have little eﬀect on the

movement of the input vector of w

. The movement of the input vector of w

is determined

by the prediction error of all vectors in the vocabulary; the larger the prediction error, the

more signiﬁcant eﬀects a word will exert on the movement on the input vector of the

context word.

As we iteratively update the model parameters by going through context-target word

pairs generated from a training corpus, the eﬀects on the vectors will accumulate. We

can imagine that the output vector of a word w is “dragged” back-and-forth by the input

vectors of w’s co-occurring neighbors, as if there are physical strings between the vector

of w and the vectors of its neighbors. Similarly, an input vector can also be considered as

being dragged by many output vectors. This interpretation can remind us of gravity, or

force-directed graph layout. The equilibrium length of each imaginary string is related to

the strength of cooccurrence between the associated pair of words, as well as the learning

rate. After many iterations, the relative positions of the input and output vectors will

eventually stabilize.

1.2 Multi-word context

Figure 2 shows the CBOW model with a multi-word context setting. When computing

the hidden layer output, instead of directly copying the input vector of the input context

word, the CBOW model takes the average of the vectors of the input context words, and

剩余20页未读，继续阅读

Mary_ML1

粉丝: 1
资源: 3

word2vec模型详解：参数学习过程

WordEmbedding-WikiChinese：基于中文维基百科文本数据训练词向量

word2vec词向量入门

wiki-news-300d-1M.vec.zip

词向量与word2vec深度解析

词向量与word2vec实现探究

深度学习与词向量：word2vec解析

TensorFlow自然语言处理：词向量模型Word2vec详解

探索基于Word2Vec的词向量表示

Word2Vec模型的词向量可视化方法

掌握Word2Vec模型：词向量表示与相似度计算

最新资源