谷歌Word2Vec：高效词向量表示学习

需积分: 31 181 浏览量更新于2024-09-12 收藏 223KB PDF 举报

"GOOGLE WORD2VEC 论文是关于如何在大规模数据集上计算单词的连续向量表示的，提出了两种新颖的模型架构。这些表示的质量通过词相似性任务进行衡量，并与之前基于不同类型的神经网络的最佳技术进行了比较。实验显示，新方法在准确度上有显著提高，同时计算成本降低，能在一天内从16亿词的数据集中学习到高质量的词向量。此外，这些向量在语法和语义的词相似性测试集上展现出最先进的性能。" 谷歌的Word2Vec是一种革命性的自然语言处理（NLP）技术，由Tomas Mikolov等人在2013年提出，旨在将单词转换为连续的、高维的向量形式，以便更好地捕捉词汇间的语义和句法关系。这项工作主要包含两种模型：Continuous Bag-of-Words (CBOW) 和 Skip-gram。 1. CBOW模型： CBOW是通过预测一个单词的上下文词来学习其向量表示。在这个模型中，一段上下文窗口内的单词被用来预测中心词。这种设计使得模型能够学习到单词之间的共现信息，从而捕获到语境中的相关性。 2. Skip-gram模型：与CBOW相反，Skip-gram尝试预测给定中心词的上下文词。它试图理解一个单词如何影响其周围环境，这有助于识别单词的意义和用法。 3. Word Embeddings： Word2Vec的主要贡献之一是产生了高质量的词嵌入（word embeddings）。这些向量空间中的每个维度都代表了特定的语义或句法特征，使得相似的单词在向量空间中接近。例如，“king”和“queen”的向量相差很小，而“king”和“man”的向量差异则反映性别关系。 4. Negative Sampling：为了高效训练模型，Word2Vec采用了负采样技术。这种方法在每个训练步骤中仅处理一部分负样本，减少了计算复杂性，加快了训练速度，同时保持了模型的准确性。 5. 应用与效果： Word2Vec在各种NLP任务中表现出色，包括词性标注、句法分析、机器翻译、情感分析等。它不仅提高了这些任务的性能，还为研究人员提供了一种探索词汇关系的有效工具，如“巴黎” - “法国” = “东京” - “？”这类问题，可以发现“东京”对应的是“日本”。 6. 性能对比：在论文中，Word2Vec模型与基于神经网络的其他方法进行了比较，如Neural Network Language Models (NNLM) 和 Hierarchical Softmax。实验结果显示Word2Vec在词相似性任务上的性能显著优于这些方法，尤其是在计算效率方面。总结来说，Word2Vec通过创新的模型设计和高效的训练策略，极大地推动了NLP领域的发展，使得理解和处理自然语言的能力得到了显著提升。这些向量表示不仅在学术界受到广泛关注，也在工业界得到了广泛应用，成为现代NLP系统的基础组成部分。

For all the following models, the training complexity is proportional to

O = E × T × Q, (1)

where E is number of the training epochs, T is the number of the words in the training set and Q is

deﬁned further for each model architecture. Common choice is E = 3 − 50 and T up to one billion.

All models are trained using stochastic gradient descent and backpropagation [26].

2.1 Feedforward Neural Net Language Model (NNLM)

The probabilistic feedforward neural network language model has been proposed in [1]. It consists

of input, projection, hidden and output layers. At the input layer, N previous words are encoded

using 1-of-V coding, where V is size of the vocabulary. The input layer is then projected to a

projection layer P that has dimensionality N × D, using a shared projection matrix. As only N

inputs are active at any given time, composition of the projection layer is a relatively cheap operation.

The NNLM architecture becomes complex for computation between the projection and the hidden

layer, as values in the projection layer are dense. For a common choice of N = 10, the size of the

projection layer (P ) might be 500 to 2000, while the hidden layer size H is typically 500 to 1000

units. Moreover, the hidden layer is used to compute probability distribution over all the words in the

vocabulary, resulting in an output layer with dimensionality V . Thus, the computational complexity

per each training example is

Q = N × D + N × D × H + H × V, (2)

where the dominating term is H × V . However, several practical solutions were proposed for

avoiding it; either using hierarchical versions of the softmax [25, 23, 18], or avoiding normalized

models completely by using models that are not normalized during training [4, 9]. With binary tree

representations of the vocabulary, the number of output units that need to be evaluated can go down

to around log

(V ). Thus, most of the complexity is caused by the term N × D × H.

In our models, we use hierarchical softmax where the vocabulary is represented as a Huffman binary

tree. This follows previous observations that the frequency of words works well for obtaining classes

in neural net language models [16]. Huffman trees assign short binary codes to frequent words, and

this further reduces the number of output units that need to be evaluated: while balanced binary tree

would require log

(V ) outputs to be evaluated, the Huffman tree based hierarchical softmax requires

only about log

(Unigram perplexity(V )). For example when the vocabulary size is one million

words, this results in about two times speedup in evaluation. While this is not crucial speedup for

neural network LMs as the computational bottleneck is in the N × D× H term, we will later propose

architectures that do not have hidden layers and thus depend heavily on the efﬁciency of the softmax

normalization.

2.2 Recurrent Neural Net Language Model (RNNLM)

Recurrent neural network based language model has been proposed to overcome certain limitations

of the feedforward NNLM, such as the need to specify the context length (the order of the model N),

and because theoretically RNNs can efﬁciently represent more complex patterns than the shallow

neural networks [15, 2]. The RNN model does not have a projection layer; only input, hidden and

output layer. What is special for this type of model is the recurrent matrix that connects hidden

layer to itself, using time-delayed connections. This allows the recurrent model to form some kind

of short term memory, as information from the past can be represented by the hidden layer state that

gets updated based on the current input and the state of the hidden layer in the previous time step.

The complexity per training example of the RNN model is

Q = H × H + H × V, (3)

where the word representations D have the same dimensionality as the hidden layer H. Again, the

term H × V can be efﬁciently reduced to H × log

(V ) by using hierarchical softmax. Most of the

complexity then comes from H × H.

剩余11页未读，继续阅读

word22vec

粉丝: 0
资源: 1

谷歌Word2Vec：高效词向量表示学习

Word2vec谷歌词向量

google word2vec

Google word2vec算法 数学原理

word2vec, Google word2vec的python 接口.zip

谷歌师兄的leetcode刷题笔记-word2vec:谷歌Word2vec源代码

google word2vec相关论文

google word2vec开源项目

word2vec-google-news-api:word2vec项目提供的Google新闻word2vec模型的REST API

word2vec：Google的Python接口word2vec

Google Word2vec：深度学习中的词向量技术

最新资源

Google word2vec算法数学原理