doc2vec技术解析：超越bag-of-words的文本表示方法

需积分: 50 32 浏览量更新于2024-09-09 1 收藏 143KB PDF 举报

“doc2vec是谷歌提出的一种分布式文本表示方法，旨在克服传统词袋模型的局限性，通过学习固定长度的特征向量来捕获文本中的语义信息和上下文顺序。” 在机器学习领域，输入数据通常需要转化为固定长度的特征向量。对于文本处理，最常用的方法就是词袋模型（Bag-of-Words）。然而，词袋模型有两个显著的缺点：一是忽略了词序信息，二是没有考虑词的语义。例如，“powerful”、“strong”和“Paris”在词袋模型中距离相等，这显然不能准确反映它们在语义上的差异。 doc2vec，也称为Paragraph Vector，由Quoc Le和Tomas Mikolov等人在Google提出，是一种无监督的学习算法，专门针对变长度的文本片段，如句子、段落或文档，学习得到固定长度的特征表示。该方法的核心在于，每个文档被表示为一个稠密向量，这个向量在训练过程中被优化以预测文档中的词语。通过这种方式，doc2vec能够捕捉到词序信息，并在一定程度上理解词的语义，从而弥补了词袋模型的不足。实验结果表明，Paragraph Vector在文本表示性能上优于词袋模型，并且在其他文本处理任务中，如文本分类、文档相似度计算等方面，也展现出优越的表现。doc2vec的关键在于其两种实现方式： Distributed Bag of Words (DBOW) 和 Distributed Memory (DM) 模型。DBOW模型试图通过上下文预测单词，而DM模型则尝试通过单词来预测上下文，这两种方式都能够在训练过程中学到更丰富的文本信息。 doc2vec的优势在于，它不仅能够捕获局部的上下文信息，还能理解全局的语义结构。因此，doc2vec在自然语言处理任务中，如问答系统、情感分析、信息检索等领域，都有广泛的应用。此外，doc2vec的向量表示还可以用于计算两个文本之间的相似度，这对于推荐系统、文本聚类等任务也是十分有价值的。 doc2vec是一种强大的工具，它通过学习和生成具有语义和上下文信息的固定长度向量，提升了文本数据的表示质量，从而在多种文本处理任务中取得了优异的性能。尽管doc2vec在某些复杂情境下可能仍存在局限性，但其在理解和表示文本方面的贡献已经对自然语言处理领域产生了深远的影响。

Distributed Representations of Sentences and Documents

Quoc Le QVL@GOOGLE.COM

Tomas Mikolov TMIKOLOV@GOOGLE.COM

Google Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043

Abstract

Many machine learning algorithms require the

input to be represented as a ﬁxed-length feature

vector. When it comes to texts, one of the most

common ﬁxed-length features is bag-of-words.

Despite their popularity, bag-of-words features

have two major weaknesses: they lose the order-

ing of the words and they also ignore semantics

of the words. For example, “powerful,” “strong”

and “Paris” are equally distant. In this paper, we

propose Paragraph Vector, an unsupervised algo-

rithm that learns ﬁxed-length feature representa-

tions from variable-length pieces of texts, such as

sentences, paragraphs, and documents. Our algo-

rithm represents each document by a dense vec-

tor which is trained to predict words in the doc-

ument. Its construction gives our algorithm the

potential to overcome the weaknesses of bag-of-

words models. Empirical results show that Para-

graph Vectors outperforms bag-of-words models

as well as other techniques for text representa-

tions. Finally, we achieve new state-of-the-art re-

sults on several text classiﬁcation and sentiment

analysis tasks.

1. Introduction

Text classiﬁcation and clustering play an important role

in many applications, e.g, document retrieval, web search,

spam ﬁltering. At the heart of these applications is ma-

chine learning algorithms such as logistic regression or K-

means. These algorithms typically require the text input to

be represented as a ﬁxed-length vector. Perhaps the most

common ﬁxed-length vector representation for texts is the

bag-of-words or bag-of-n-grams (Harris, 1954) due to its

simplicity, efﬁciency and often surprising accuracy.

However, the bag-of-words (BOW) has many disadvan-

Proceedings of the 31

International Conference on Machine

Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-

right 2014 by the author(s).

tages. The word order is lost, and thus different sentences

can have exactly the same representation, as long as the

same words are used. Even though bag-of-n-grams con-

siders the word order in short context, it suffers from data

sparsity and high dimensionality. Bag-of-words and bag-

of-n-grams have very little sense about the semantics of the

words or more formally the distances between the words.

This means that words “powerful,” “strong” and “Paris” are

equally distant despite the fact that semantically, “power-

ful” should be closer to “strong” than “Paris.”

In this paper, we propose Paragraph Vector, an unsuper-

vised framework that learns continuous distributed vector

representations for pieces of texts. The texts can be of

variable-length, ranging from sentences to documents. The

name Paragraph Vector is to emphasize the fact that the

method can be applied to variable-length pieces of texts,

anything from a phrase or sentence to a large document.

In our model, the vector representation is trained to be use-

ful for predicting words in a paragraph. More precisely, we

concatenate the paragraph vector with several word vec-

tors from a paragraph and predict the following word in the

given context. Both word vectors and paragraph vectors are

trained by the stochastic gradient descent and backpropaga-

tion (Rumelhart et al., 1986). While paragraph vectors are

unique among paragraphs, the word vectors are shared. At

prediction time, the paragraph vectors are inferred by ﬁx-

ing the word vectors and training the new paragraph vector

until convergence.

Our technique is inspired by the recent work in learn-

ing vector representations of words using neural net-

works (Bengio et al., 2006; Collobert & Weston, 2008;

Mnih & Hinton, 2008; Turian et al., 2010; Mikolov et al.,

2013a;c). In their formulation, each word is represented by

a vector which is concatenated or averaged with other word

vectors in a context, and the resulting vector is used to pre-

dict other words in the context. For example, the neural

network language model proposed in (Bengio et al., 2006)

uses the concatenation of several previous word vectors to

form the input of a neural network, and tries to predict the

next word. The outcome is that after the model is trained,

the word vectors are mapped into a vector space such that

下载后可阅读完整内容，剩余8页未读，立即下载

小智Robo

粉丝: 19

doc2vec技术解析：超越bag-of-words的文本表示方法

doc2vec训练与相似度计算.rar

doc-similarity:①TF-IDF LSI ③Doc2Vec DM DBOW 文档相似度

GA-DTCDR:这是“双目标跨域建议的图形和注意框架”（IJCAI2020）中的模型。 GA-DTCDR是DTCDR（“ DTCDR

doc2vec:使用Gensim训练doc2vec模型的Python脚本

论文研究-Doc2vec在薪水预测中的应用研究.pdf

google word2vec相关论文

Doc2Vec:任务是学习文档和单词的代表性向量

基于Word2Vec及多分类任务的影评分类.doc

awesome-2vec：2vec型嵌入模型的精选列表

词向量模型（word2vec）总结笔记

最新资源