Paragraph Vector：超越词袋模型的文本表示方法

需积分: 10 93 浏览量更新于2024-09-13 收藏 181KB PDF 举报

“本文介绍了词向量的先驱工作——Paragraph Vector，这是一种无监督算法，能够从可变长度的文本片段（如句子、段落和文档）中学习固定长度的特征表示。该算法解决了传统词袋模型的两个主要缺点：忽视词序和忽略词义。Paragraph Vector通过训练每个文档的稠密向量来预测文档中的单词，从而在结构上具有克服这些缺点的潜力。实验表明，Paragraph Vector在文本分类和情感分析任务上优于词袋模型和其他文本表示技术，并且在多个任务上达到了新的最优结果。” 这篇论文深入探讨了自然语言处理（NLP）领域的一个重要话题，即如何有效地将文本转换为机器学习算法可以理解的固定长度向量。词向量，尤其是Word2vec，是这个领域的重要里程碑，它为每个词生成一个向量表示，使得语义相近的词在向量空间中距离更近。然而，词向量通常关注单个词汇，而忽略了上下文信息。 Paragraph Vector（又称Doc2Vec）是对这一概念的扩展，它不仅考虑单个词，还考虑整个句子、段落或文档的上下文。算法的核心在于，它通过训练一个模型，使文档向量能够预测其内部的单词。这样，整个文档被表征为一个稠密向量，这个向量捕捉了文本的结构和语义信息，克服了词袋模型忽视词序和语义的局限性。在实验部分，作者展示了Paragraph Vector在多项文本处理任务上的优越性能，比如文本分类和情感分析。这些任务通常需要理解和理解文本的深层含义，Paragraph Vector的向量表示能够更好地捕获这些信息，从而提高模型的预测准确率。通过这些实证研究，论文证明了Paragraph Vector的有效性和创新性，为后续的NLP研究和应用奠定了基础。这篇论文对NLP领域的词向量表示进行了重要贡献，提出了Paragraph Vector这一强大的工具，对于理解和处理文本数据，特别是在无监督学习环境中，具有深远的影响。通过学习和应用这一方法，研究人员和工程师可以构建出更加精准和强大的文本分析系统。

arXiv:1405.4053v2 [cs.CL] 22 May 2014

Distributed Representations of Sentences and Documents

Quoc Le QVL@GOOGLE.COM

Tomas Mikolov TMIKOLOV@GOOGLE.COM

Google Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043

Abstract

Many machine learning algorithms require the

input to be represented as a ﬁxed-length feature

vector. When it comes to texts, one of the most

common ﬁxed-length features is bag-of-words.

Despite their popularity, bag-of-words features

have two major weaknesses: they lose the order-

ing of the words and they also ignore semantics

of the words. For example, “powerful,” “strong”

and “Paris” are equally distant. In this paper, we

proposeParagraphVector, an unsupervised algo-

rithm that learns ﬁxed-length feature representa-

tions from variable-length pieces of texts, such as

sentences, paragraphs, and documents. Our algo-

rithm represents each document by a dense vec-

tor which is trained to predict words in the doc-

ument. Its construction gives our algorithm the

potential to overcome the weaknesses of bag-of-

words models. Empirical results show that Para-

graph Vectors outperform bag-of-words models

as well as other techniques for text representa-

tions. Finally, we achieve new state-of-the-art re-

sults on several text classiﬁcation and sentiment

analysis tasks.

1. Introduction

Text classiﬁcation and clustering play an important role

in many applications, e.g, document retrieval, web search,

spam ﬁltering. At the heart of these applications is ma-

chine learning algorithms such as logistic regression or K-

means. These algorithms typically require the text input to

be represented as a ﬁxed-length vector. Perhaps the most

common ﬁxed-length vector representation for texts is the

bag-of-words or bag-of-n-grams (Harris, 1954) due to its

simplicity, efﬁciency and often surprising accuracy.

However, the bag-of-words (BOW) has many disadvan-

Proceedings of the 31

International Conference on Machine

Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-

right 2014 by the author(s).

tages. The word order is lost, and thus different sentences

can have exactly the same representation, as long as the

same words are used. Even though bag-of-n-grams con-

siders the word order in short context, it suffers from data

sparsity and high dimensionality. Bag-of-words and bag-

of-n-grams have very little sense about the semantics of the

words or more formally the distances between the words.

This means that words “powerful,” “strong” and “Paris” are

equally distant despite the fact that semantically, “power-

ful” should be closer to “strong” than “Paris.”

In this paper, we propose Paragraph Vector, an unsuper-

vised framework that learns continuous distributed vector

representations for pieces of texts. The texts can be of

variable-length, ranging from sentences to documents. The

name Paragraph Vector is to emphasize the fact that the

method can be applied to variable-length pieces of texts,

anything from a phrase or sentence to a large document.

In our model, the vector representation is trained to be use-

ful for predicting words in a paragraph. More precisely, we

concatenate the paragraph vector with several word vec-

tors from a paragraph and predict the following word in the

given context. Both word vectorsand paragraph vectors are

trained by the stochastic gradient descent and backpropaga-

tion (Rumelhart et al., 1986). While paragraph vectors are

unique among paragraphs, the word vectors are shared. At

prediction time, the paragraph vectors are inferred by ﬁx-

ing the word vectors and training the new paragraph vector

until convergence.

Our technique is inspired by the recent work in learn-

ing vector representations of words using neural net-

works (Bengio et al., 2006; Collobert & Weston, 2008;

Mnih & Hinton, 2008; Turian et al., 2010; Mikolov et al.,

2013a;c). In their formulation, each word is represented by

a vector which is concatenated or averaged with other word

vectors in a context, and the resulting vector is used to pre-

dict other words in the context. For example, the neural

network language model proposed in (Bengio et al., 2006)

uses the concatenation of several previous word vectors to

form the input of a neural network, and tries to predict the

next word. The outcome is that after the model is trained,

the word vectors are mapped into a vector space such that

下载后可阅读完整内容，剩余8页未读，立即下载

艾鹤

粉丝: 2887

Paragraph Vector：超越词袋模型的文本表示方法

Distributed Representations of Sentences and Documents

Distributed Representations of Sentences and Documents阅读笔记

distributed_voip_suite_datasheet.pdf

Python库 | function_scheduling_distributed_framework-4.7.tar.gz

_DG_Grid-interconnection.zip_DG_Distributed_GRID matlab_generati

PT-Constraint.rar_Distributed_distributed power

Learning distributed representations of concepts.pdf

NSX-T hol-2026-01-net_pdf_en.pdf

The_Blockchain_Developer__A_Practical(2019).pdf

WIRELESS-SENSOR-NETWORK.rar_Distributed network_WSN_sensor_wirel

最新资源