从Word2Vec到BERT：上下文嵌入(ContextualEmbedding)最新综述论文.pdf_词嵌入

词嵌入表示向量

需积分: 8 83 浏览量更新于2023-03-03 评论 1 收藏 220KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

arXiv:2003.07278v1 [cs.CL] 16 Mar 2020

A Survey on Contextual Embeddings

Qi Liu

‡

, Matt J. Kusner

†∗

, Phil Blunsom

‡⋄

‡

University of Oxford

⋄

DeepMind

†

University College London

∗

The Alan Turing Instit ute

‡

{firstname.lastname}@cs.ox.ac.uk

†

m.kusner@ucl.ac.uk

Abstract

Contextual embedd ings, such as ELMo and

BERT, move beyond global word represen-

tations like Word2Vec and achieve ground-

breaking performance on a wide range of natu-

ral language processing tasks. Contextual em-

beddings assign each word a representation

based on its context, thereby capturing uses

of words across varied contexts and encod-

ing knowledge that transfer s across languages.

In this survey, we review existing contextual

embedd ing models, cross-lingual polyglot pre-

training, the app lication of contextual embed-

dings in downstream tasks, m odel compres-

sion, and model analyses.

1 Introduction

Distributional word representations (

Turian et al.,

2010; Mikolov et al., 2013; Pennington et al.,

2014) trained in an unsupervised manner on

large-scale corpora are widely used in modern

natural language processing systems. However,

these approaches only obtain a single global rep-

resentation for each word, ignoring their context.

Different from traditional word representations,

contextual embeddings move beyond word-level

semantics in that each token is associated with a

representation that is a function of the entire input

sequence. These context-dependent representa-

tions can capture many syntactic and semantic

properties of words under diverse linguistic

contexts. Previous work (

Peters et al., 2018;

Devlin et al., 2018; Yang et al., 2019; Raffel et al.,

2019) has shown that contextual embeddings pre-

trained on large-scale unlabelled corpora achieve

state-of-the-art performance on a w ide range of

natural language processing tasks, such as text

classiﬁcation, question answering and text sum-

marization. Further analyses (

Liu et al., 2019a;

Hewitt and Liang, 2019; Hewitt and Manning,

2019; Tenney et al., 2019a) demonstrate that

contextual embeddings are capable of learning

useful and transferable representations across

languages.

The rest of the survey is organized as follows.

In Section

2, we deﬁne the concept of contextual

embeddings. In Section

3, we introduce existing

methods for obtaining contextual embeddings. In

Section

4, we present the pre-training methods of

contextual embeddings on multi-lingual corpora.

In Section 5, we describe methods for applying

pre-trained contextual embeddings in downstream

tasks. In Section 6, we detail model compression

methods. In Section

7, we survey analyses that

have aimed to identify the linguistic knowledge

learned by contextual embeddings. We conclude

the survey by highlighting some challenges for fu-

ture research in Section

2 Token Embeddings

Consider a text corpus that is represented as

a sequence S of tokens, (t

, t

, ..., t

). Dis-

tributed representations of words (

Harris, 1954;

Bengio et al., 2003) associate each token t

with

a dense feature vector h

. Traditional word em-

bedding techniques aim to learn a global word em-

bedding matrix E ∈ R

V ×d

, where V is the vo-

cabulary size and d is the number of dimensions.

Speciﬁcally, each row e

of E corresponds to the

global embedding of word type i in the vocabu-

lary V . Well-known m odels for learning word em-

beddings include Word2vec (

Mikolov et al., 2013)

and Glove (

Pennington et al., 2014). On the

other hand, methods that learn contextual embed-

dings associate each token t

with a represen-

tation that is a function of the entire input se-

quence S, i.e. h

= f (e

, e

, ..., e

), w here

each input token t

is usually mapped to its non-

contextualized representation e

ﬁrst, before ap-

plying an aggregation function f. These context-

dependent representations are better suited to

capture sequence-level semantics (e.g. polysemy)

than non-contextual word embeddings. There are

many model architectures for f , w hich we review

here. We begin by describing pre-training meth-

ods for learning contextual embeddings that can

be used in downstream tasks.

3 Pre-training Methods for Contextual

Embeddings

In large part, pre-training contextual embeddings

can be divided into either unsupervised methods

(e.g. language modelling and its variants) or super-

vised methods (e.g. machine translation and natu-

ral language inference).

3.1 Unsupervised Pre-training via Language

Modeling

The prototypical way to learn distributed token

embeddings is via language modelling. A lan-

guage model is a probability distribution over a

sequence of tokens. Given a sequence of N to-

kens, (t

, t

, ..., t

), a language model factorizes

the probability of the sequence as:

p(t

, t

, ..., t

) =

i=1

p(t

, t

, ..., t

i−1

). (1)

Language modelling uses maximum likelihood

estimation (MLE), often penalized with regular-

ization terms, to estimate model parameters. A

left-to-right language model takes the left con-

text, t

, t

, ..., t

i−1

, of t

into account for esti-

mating the conditional probability. Language

models are usually trained using large-scale un-

labelled corpora. The conditional probabilities

are most commonly learned using neural networks

(

Bengio et al., 2003), and the learned represen-

tations have been proven to be transferable to

downstream natural language understanding tasks

(

Dai and Le, 2015; Ramachandran et al., 2016).

Precursor Models. D ai and Le (

2015) is the ﬁrst

work we are aware of that uses language modelling

together with a sequence autoencoder to improve

sequence learning with recurrent networks. Thus,

it can be thought of as a precursor to modern con-

textual embedding methods. Pre-trained on the

datasets IMDB, Rotten Tomatoes, 20 Newsgroups,

and DBpedia, the model is then ﬁne-tuned on senti-

ment analysis and text classiﬁcation tasks, achiev-

ing strong performance compared to randomly-

initialized models.

Ramachandran et al. (

2016) extends Dai and

Le (

2015) by proposing a pre-training method to

improve the accuracy of sequence to sequence

(seq2seq) models. The encoder and decoder of the

seq2seq model is initialized w ith the pre-trained

weights of two language models. These language

models are separately trained on either the News

Crawl English or G erman corpora for machine

translation, while both are initialized w ith the lan-

guage model trained with the English Gigaword

corpus for abstractive summarization. These pre-

trained models are ﬁne-tuned on the WMT En-

glish → German task and the CNN/Daily Mail

corpus, respectively, achieving better results over

baselines without pre-training.

The work in the following sections improves

over Dai and Le (

2015) and Ramachandran et al.

(

2016) with new architectures (e.g. Transformer),

larger datasets, and new pre-training objectives. A

summary of the models and the pre-training objec-

tives is show n in Table

1 and 2.

ELMo. The ELMo model (

Peters et al., 2018)

generalizes traditional word embeddings by ex-

tracting context-dependent representations from a

bidirectional language model. A forward L-layer

LSTM and a backward L-layer LSTM are applied

to encode the left and right contexts, respectively.

At each layer j, the contextualized representations

are the concatenation of the left-to-right and right-

to-left representations, obtaining N hidden repre-

sentations, (h

1,j

, h

2,j

, ..., h

N,j

), for a sequence of

length N.

To use ELMo in downstream tasks, the (L + 1)-

layer representations (including the global word

embedding) for each token k are aggregated as:

ELMO

task

= γ

task

j=0

task

k,j

, (2)

where s

task

are layer-wise weights normalized by

the softmax used to linearly combine the (L + 1)-

layer representations of the token k and γ

task

is a

task-speciﬁc constant.

Given a pre-trained ELMo, it is straightforward

to incorporate it into a task-speciﬁc architecture

for improving the performance. As most super-

vised models use global word representations x

in their lowest layers, these representations can

be concatenated with their corresponding context-

dependent representations ELMO

task

, obtaining

Method Architecture Encoder Decoder Objective Dataset

ELMo LSTM ✗ X LM 1B Word Benchmark

GPT Transformer ✗ X LM BookCorpus

GPT2 Transformer ✗ X LM Web pages st arting from Reddit

BERT Transformer X ✗ MLM & NSP BookCorpus & Wiki

RoBERTa Transformer X ✗ MLM BookCorpus, Wiki, CC-News, OpenWebText, Stories

ALBERT Transformer X ✗ MLM & SOP Same as RoBERTa and XLNet

UniLM Transformer X ✗ LM, MLM, seq2seq LM Same as BERT

ELECTRA Transformer X ✗ Discriminator (o/r) Same as XLNet

XLNet Transformer ✗ X PLM BookCorpus, Wiki, Giga5, ClueWeb, Common Crawl

XLM Transformer X X CLM, MLM, TLM Wiki, parellel corpora (e.g. MultiUN)

MASS Transformer X X Span Mask WMT News Crawl

T5 Transformer X X Text Inﬁl ling Colossal Clean Crawled Corpus

BART Transformer X X Text Inﬁl ling & Sent Shufﬂing Same as RoBERTa

Table 1: A comparison of pop ular pre-trained models.

Objective Inputs Targets

LM [START] I am happy to join with you today

MLM I am [MASK] to join with you [MASK] happy today

NSP Sent1 [SEP] Next Sent or Sent1 [SEP] Random Sent Next Sent/Random Sent

SOP Sent1 [SEP] Sent2 or Sent2 [SEP] Sent1 in order/reversed

Discriminator (o/r) I am thrilled to study with you today o o r o r o o o

PLM happy join with today am I to you

seq2seq LM I am happy to join with you today

Span Mask I am [MASK] [MASK] [MASK] with you today happy to join

Text Inﬁl ling I am [MASK] with you today happy to join

Sent Shufﬂing today you am I join with happy to I am happy to join with you today

TLM How [MASK] you [SEP] [MASK] vas-tu are Comment

Table 2: Pre-training objectives and their input-output formats.

; ELMO

task

], before feeding them to higher

layers.

The effectiveness of ELMo is evaluated on six

NLP problems, including question answering, tex-

tual entailment and sentiment analysis.

GPT, GPT2, and Grover. GPT (Radford et al.,

2018) adopts a two-stage learning paradigm: (a)

unsupervised pre-training using a language mod-

elling objective and (b) supervised ﬁne-tuning.

The goal is to learn universal representations trans-

ferable to a wide range of downstream tasks.

To this end, GPT uses the BookCorpus dataset

(Zhu et al., 2015), which contains more than 7,000

books from various genres, for training the lan-

guage model. The Transformer architecture

(

Vaswani et al., 2017) is used to implement the

language model, which has been shown to bet-

ter capture global dependencies from the inputs

compared to its alternatives, e.g. recurrent net-

works, and perform strongly on a range of se-

quence learning tasks, such as machine transla-

tion (Vaswani et al., 2017) and document gener-

ation (

Liu et al., 2018). To use GPT on inputs

with multiple sequences during ﬁne-tuning, GPT

applies task-speciﬁc input adaptations motivated

by traversal-style approaches (Rockt¨aschel et al.,

2015). These approaches pre-process each text

input as a single contiguous sequence of tokens

through special tokens including [START] (the

start of a sequence), [DELIM] (delimiting two se-

quences from the text input) and [EXTRACT] (the

end of a sequence). GPT outperforms task-speciﬁc

architectures in 9 out of 12 tasks studied with a pre-

trained Transformer.

GPT2 (

Radford et al., 2019) mainly follows the

architecture of GPT and trains a language model

on a dataset as large and diverse as possible to

learn from varied domains and contexts. To do

so,

Radford et al. (2019) create a new dataset of

millions of web pages named WebText, by scrap-

ing outbound links from Reddit. The authors ar-

gue that a language model trained on large-scale

unlabelled corpora begins to learn some common

supervised NLP tasks, such as question answer-

ing, machine translation and summarization, with-

out any explicit supervision signal. To validate

this, GPT2 is tested on ten datasets (e.g. Chil-

dren’s Book Test (

Hill et al., 2015), LAMBADA

(Paperno et al., 2016) and CoQA (Reddy et al.,

剩余12页未读，继续阅读

syp_net

粉丝: 158
资源: 1196

会员权益专享

从Word2Vec到BERT：上下文嵌入 (Contextual Embedding) 最新综述论文.pdf

评论0

会员权益专享

最新资源

从Word2Vec到BERT：上下文嵌入 (Contextual Embedding) 最新综述论文.pdf

评论0

A Survey on Contextual Embeddings.pdf

基于词嵌入技术的文本表示研究现状综述_刘胜杰.pdf

Python-按word2vec格式存储的BERT预训练模型

词嵌入技术综述：Word2Vec和GloVe对比分析

用代码实现以下要求：将word2vec的词嵌入并入到bert模型中

Can't pickle <class 'gensim.models.word2vec.Word2Vec'>: import of module 'gensim.models.word2vec' failed

PicklingError: Can't pickle <class 'gensim.models.word2vec.Word2Vec'>: import of module 'gensim.models.word2vec' failed

word2vec和bert在特征提取的局别

Bert词向量较word2vec优势

word2vec的词嵌入大小用代码怎么表示出来

ModuleNotFoundError: No module named 'word2vec.vocabs'

word2vec与glove与bert之间的关系与区别

word2vec和glove的词嵌入怎么融合？用代码表示出来

word2vec中UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 7: invalid start byte

word embedding和word2vec

加载一个word2vec模型时出现AttributeError: Model of type <class 'gensim.models.keyedvectors.KeyedVectors'> can't be loaded by <class 'gensim.models.word2vec.Word2Vec'>

cannot import name 'word2vec' from 'gensim.models.word2vec' (D:\Anaconda\lib\site-packages\gensim\models\word2vec.py)

融合word2vec和golve的词嵌入，代码示例

embedding层和word2vec区别

word2vec 怎么样感知上下文？

会员权益专享

最新资源