词向量表示的高效估计方法

需积分: 0 111 浏览量更新于2024-08-05 收藏 246KB PDF 举报

"Mikolov等人在2013年发表的文章‘Efficient Estimation of Word Representations in Vector Space’探讨了如何在大规模数据集中高效地计算单词的连续向量表示。该研究由Google Inc.的多位研究人员共同完成，包括Tomas Mikolov、Kai Chen、G.s. Corrado和Jeffrey Dean。文章提出的新模型架构在词汇相似性任务中表现优秀，超过了先前的最佳性能。" 在自然语言处理领域，词嵌入（Word Embedding）是一种将单词转换为实数向量的技术，它能够捕捉到词汇间的语义和语法关系。Mikolov等人在2013年的这篇论文中，主要贡献了两种创新的模型架构，即Word2Vec模型，用于生成高质量的词向量。 1. **CBOW (Continuous Bag of Words) 模型**： CBOW模型的目标是通过上下文单词的向量平均来预测中心词的向量。这一方法假设一个单词的意义可以从其周围的上下文中推断出来。在训练过程中，模型会学习到一个映射函数，该函数能够将上下文窗口中的单词向量加权平均后，映射到中心词的向量。CBOW模型的优势在于快速计算，但可能忽视了某些单词的顺序信息。 2. **Skip-gram 模型**：相比于CBOW，Skip-gram模型采取相反的策略，它尝试预测给定中心词周围的上下文单词。这意味着模型会学习如何从一个单词的向量表示出发，去预测其可能的上下文单词。Skip-gram模型能够更好地捕获单词之间的长期依赖，但训练速度相对较慢。在训练这些模型时，Mikolov等人使用了一种称为负采样的优化技术。负采样是随机选取一部分“负样本”（非上下文单词），并将它们与真实的上下文单词一起用于训练，这有助于模型更快地收敛并减少过拟合。实验结果表明，这两种模型在词汇相似性任务上表现出色，例如在WordSim-353、MSR和SimLex-999等评价基准上，它们的性能优于先前的方法，如基于矩阵分解的WordNet。此外，这些词向量在其他NLP任务中也有广泛的应用，如情感分析、问答系统和机器翻译。 Mikolov等人的工作对后来的深度学习研究产生了深远的影响，为语言模型、信息检索、推荐系统等多个领域提供了强大的工具。他们的成果不仅提升了模型的效率，而且使得计算机能更好地理解和生成人类语言，推动了自然语言处理技术的发展。

than a few hundred of millions of words, with a modest dimensionality of the word vectors between

50 - 100.

We use recently proposed techniques for measuring the quality of the resulting vector representa-

tions, with the expectation that not only will similar words tend to be close to each other, but that

words can have multiple degrees of similarity [20]. This has been observed earlier in the context

of inﬂectional languages - for example, nouns can have multiple word endings, and if we search for

similar words in a subspace of the original vector space, it is possible to ﬁnd words that have similar

endings [13, 14].

Somewhat surprisingly, it was found that similarity of word representations goes beyond simple

syntactic regularities. Using a word offset technique where simple algebraic operations are per-

formed on the word vectors, it was shown for example that vector(”King”) - vector(”Man”) + vec-

tor(”Woman”) results in a vector that is closest to the vector representation of the word Queen [20].

In this paper, we try to maximize accuracy of these vector operations by developing new model

architectures that preserve the linear regularities among words. We design a new comprehensive test

set for measuring both syntactic and semantic regularities

, and show that many such regularities

can be learned with high accuracy. Moreover, we discuss how training time and accuracy depends

on the dimensionality of the word vectors and on the amount of the training data.

1.2 Previous Work

Representation of words as continuous vectors has a long history [10, 26, 8]. A very popular model

architecture for estimating neural network language model (NNLM) was proposed in [1], where a

feedforward neural network with a linear projection layer and a non-linear hidden layer was used to

learn jointly the word vector representation and a statistical language model. This work has been

followed by many others.

Another interesting architecture of NNLM was presented in [13, 14], where the word vectors are

ﬁrst learned using neural network with a single hidden layer. The word vectors are then used to train

the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this

work, we directly extend this architecture, and focus just on the ﬁrst step where the word vectors are

learned using a simple model.

It was later shown that the word vectors can be used to signiﬁcantly improve and simplify many

NLP applications [4, 5, 29]. Estimation of the word vectors itself was performed using different

model architectures and trained on various corpora [4, 29, 23, 19, 9], and some of the resulting word

vectors were made available for future research and comparison

. However, as far as we know, these

architectures were signiﬁcantly more computationally expensive for training than the one proposed

in [13], with the exception of certain version of log-bilinear model where diagonal weight matrices

are used [23].

2 Model Architectures

Many different types of models were proposed for estimating continuous representations of words,

including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).

In this paper, we focus on distributed representations of words learned by neural networks, as it was

previously shown that they perform signiﬁcantly better than LSA for preserving linear regularities

among words [20, 31]; LDA moreover becomes computationally very expensive on large data sets.

Similar to [18], to compare different model architectures we deﬁne ﬁrst the computational complex-

ity of a model as the number of parameters that need to be accessed to fully train the model. Next,

we will try to maximize the accuracy, while minimizing the computational complexity.

The test set is available at www.fit.vutbr.cz/

imikolov/rnnlm/word-test.v1.txt

http://ronan.collobert.com/senna/

http://metaoptimize.com/projects/wordreprs/

http://www.fit.vutbr.cz/

imikolov/rnnlm/

http://ai.stanford.edu/

ehhuang/

剩余12页未读，继续阅读

宏馨

粉丝: 25
资源: 293

词向量表示的高效估计方法

词向量-开山之作1-Efficient estimation of word representations in vector space.pdf

Efficient Estimation of Word Representations in_中文版.pdf

NLP-Word2Vec.rar

自然语言处理之动手学词向量（word embedding） 动手学词向量知识讲解 共101页.pdf

词向量革命：Efficient Estimation的开创性工作

Word2Vec原始论文解读：词向量的高效估计

Word2Vec：向量空间中的词表示与效率提升

Word2Vec原始论文解析：大规模并行训练与词向量表示

Word2Vec深度解析：海量数据下的词汇向量训练与模型优化

简介 Word2Vec 模型及其在词向量表示中的应用

最新资源

自然语言处理之动手学词向量（word embedding）动手学词向量知识讲解共101页.pdf