深度学习与自然语言处理：上下文嵌入模型解析

需积分: 10 40 浏览量更新于2024-08-30 收藏 222KB PDF 举报

"这篇文档是关于上下文嵌入（Contextual Embeddings）的综述，主要讨论了如何从全局词向量（如Word2Vec）发展到能够根据上下文提供词表示的模型，如ELMo和BERT。这些模型在自然语言处理任务中取得了突破性的成绩，通过捕捉词汇在不同语境中的用法，编码跨语言的知识，从而实现知识的迁移。文章涵盖了现有的上下文嵌入模型、跨语言多语种预训练、下游任务的应用、模型压缩以及模型分析等多个方面。" 在这篇综述中，作者首先介绍了词的分布表示（Distributional word representations），这是基于大规模语料库无监督训练的词向量，如Word2Vec、GloVe和BERT等。这些传统的词向量方法为每个词提供了固定不变的表示，虽然在很多任务上表现出色，但它们无法捕获词在不同上下文中的多样性。然后，文章重点转向了上下文嵌入模型，如ELMo（Deep Bidirectional Transformers for Language Understanding）和BERT（Bidirectional Encoder Representations from Transformers）。这些模型打破了传统词向量的限制，为每个词在特定上下文中生成动态的、丰富的表示。例如，ELMo通过双向LSTM（Long Short-Term Memory）网络考虑了词的前后文信息，而BERT则利用Transformer架构实现了前向和后向的深度双向理解。这些模型的创新之处在于，它们能够在理解语言时充分考虑词语的上下文环境，从而提高了对词汇多样性和复杂性的建模能力。接下来，作者讨论了跨语言多语种预训练（cross-lingual polyglot pre-training），这是一个重要的应用方向。通过在多种语言的大量数据上预训练模型，可以促进跨语言知识的迁移，使得模型在处理不同语言的任务时表现更佳。这对于构建多语言NLP系统具有重大意义。文章还探讨了上下文嵌入在下游任务中的应用，包括情感分析、命名实体识别、机器翻译等。这些任务利用上下文嵌入的强大力量，显著提升了性能。同时，作者也提到了模型压缩技术，这旨在减少模型的计算复杂性和存储需求，以便在资源有限的设备上部署和运行。最后，对模型进行分析的部分，作者可能涉及了模型的可解释性、词向量的可视化、以及模型内部学习的语义结构等方面的研究，以增进对模型工作原理的理解。这篇综述深入浅出地介绍了上下文嵌入的发展、优势以及在NLP领域的广泛应用，对于研究者和开发者来说是一份宝贵的参考资料。

Method Architecture Encoder Decoder Objective Dataset

ELMo LSTM ✗ X LM 1B Word Benchmark

GPT Transformer ✗ X LM BookCorpus

GPT2 Transformer ✗ X LM Web pages st arting from Reddit

BERT Transformer X ✗ MLM & NSP BookCorpus & Wiki

RoBERTa Transformer X ✗ MLM BookCorpus, Wiki, CC-News, OpenWebText, Stories

ALBERT Transformer X ✗ MLM & SOP Same as RoBERTa and XLNet

UniLM Transformer X ✗ LM, MLM, seq2seq LM Same as BERT

ELECTRA Transformer X ✗ Discriminator (o/r) Same as XLNet

XLNet Transformer ✗ X PLM BookCorpus, Wiki, Giga5, ClueWeb, Common Crawl

XLM Transformer X X CLM, MLM, TLM Wiki, parellel corpora (e.g. MultiUN)

MASS Transformer X X Span Mask WMT News Crawl

T5 Transformer X X Text Inﬁl ling Colossal Clean Crawled Corpus

BART Transformer X X Text Inﬁl ling & Sent Shufﬂing Same as RoBERTa

Table 1: A comparison of pop ular pre-trained models.

Objective Inputs Targets

LM [START] I am happy to join with you today

MLM I am [MASK] to join with you [MASK] happy today

NSP Sent1 [SEP] Next Sent or Sent1 [SEP] Random Sent Next Sent/Random Sent

SOP Sent1 [SEP] Sent2 or Sent2 [SEP] Sent1 in order/reversed

Discriminator (o/r) I am thrilled to study with you today o o r o r o o o

PLM happy join with today am I to you

seq2seq LM I am happy to join with you today

Span Mask I am [MASK] [MASK] [MASK] with you today happy to join

Text Inﬁl ling I am [MASK] with you today happy to join

Sent Shufﬂing today you am I join with happy to I am happy to join with you today

TLM How [MASK] you [SEP] [MASK] vas-tu are Comment

Table 2: Pre-training objectives and their input-output formats.

; ELMO

task

], before feeding them to higher

layers.

The effectiveness of ELMo is evaluated on six

NLP problems, including question answering, tex-

tual entailment and sentiment analysis.

GPT, GPT2, and Grover. GPT (Radford et al.,

2018) adopts a two-stage learning paradigm: (a)

unsupervised pre-training using a language mod-

elling objective and (b) supervised ﬁne-tuning.

The goal is to learn universal representations trans-

ferable to a wide range of downstream tasks.

To this end, GPT uses the BookCorpus dataset

(Zhu et al., 2015), which contains more than 7,000

books from various genres, for training the lan-

guage model. The Transformer architecture

(

Vaswani et al., 2017) is used to implement the

language model, which has been shown to bet-

ter capture global dependencies from the inputs

compared to its alternatives, e.g. recurrent net-

works, and perform strongly on a range of se-

quence learning tasks, such as machine transla-

tion (Vaswani et al., 2017) and document gener-

ation (

Liu et al., 2018). To use GPT on inputs

with multiple sequences during ﬁne-tuning, GPT

applies task-speciﬁc input adaptations motivated

by traversal-style approaches (Rockt¨aschel et al.,

2015). These approaches pre-process each text

input as a single contiguous sequence of tokens

through special tokens including [START] (the

start of a sequence), [DELIM] (delimiting two se-

quences from the text input) and [EXTRACT] (the

end of a sequence). GPT outperforms task-speciﬁc

architectures in 9 out of 12 tasks studied with a pre-

trained Transformer.

GPT2 (

Radford et al., 2019) mainly follows the

architecture of GPT and trains a language model

on a dataset as large and diverse as possible to

learn from varied domains and contexts. To do

so,

Radford et al. (2019) create a new dataset of

millions of web pages named WebText, by scrap-

ing outbound links from Reddit. The authors ar-

gue that a language model trained on large-scale

unlabelled corpora begins to learn some common

supervised NLP tasks, such as question answer-

ing, machine translation and summarization, with-

out any explicit supervision signal. To validate

this, GPT2 is tested on ten datasets (e.g. Chil-

dren’s Book Test (

Hill et al., 2015), LAMBADA

(Paperno et al., 2016) and CoQA (Reddy et al.,

剩余12页未读，继续阅读

wilosny518

粉丝: 0

深度学习与自然语言处理：上下文嵌入模型解析

从Word2Vec到BERT：上下文嵌入 (Contextual Embedding) 最新综述论文.pdf

A Survey on Contextual Multi-armed Bandits

Learning Conceptual-Contextual Embeddings for Medical Text.pdf

Identifying Encrypted Malware Traffic with Contextual Flow Data.pdf

dxRibbon(VCL)控件中Contextual Tabs的显示.pdf

Contextual-Embeddings-using-BERT

Contextual-Embeddings-using-BERT-Pytorch-

Contextual.js:Javascript上下文菜单库

hibernate_reference.pdf

seo名词解释大全.pdf

最新资源