没有合适的资源?快使用搜索试试~ 我知道了~
首页从Word2Vec到BERT:上下文嵌入 (Contextual Embedding) 最新综述论文.pdf
从Word2Vec到BERT:上下文嵌入 (Contextual Embedding) 最新综述论文.pdf
需积分: 8 24 下载量 83 浏览量
更新于2023-03-03
评论 1
收藏 220KB PDF 举报
词嵌入表示向量是自然语言处理的重要组成部分。最近来自牛津大学和DeepMind等撰写了关于上下文嵌入表示的综述论文,详述了当前预训练模型的代表性工作等。
资源详情
资源评论
资源推荐
arXiv:2003.07278v1 [cs.CL] 16 Mar 2020
A Survey on Contextual Embeddings
Qi Liu
‡
, Matt J. Kusner
†∗
, Phil Blunsom
‡⋄
,
‡
University of Oxford
⋄
DeepMind
†
University College London
∗
The Alan Turing Instit ute
‡
{firstname.lastname}@cs.ox.ac.uk
†
m.kusner@ucl.ac.uk
Abstract
Contextual embedd ings, such as ELMo and
BERT, move beyond global word represen-
tations like Word2Vec and achieve ground-
breaking performance on a wide range of natu-
ral language processing tasks. Contextual em-
beddings assign each word a representation
based on its context, thereby capturing uses
of words across varied contexts and encod-
ing knowledge that transfer s across languages.
In this survey, we review existing contextual
embedd ing models, cross-lingual polyglot pre-
training, the app lication of contextual embed-
dings in downstream tasks, m odel compres-
sion, and model analyses.
1 Introduction
Distributional word representations (
Turian et al.,
2010; Mikolov et al., 2013; Pennington et al.,
2014) trained in an unsupervised manner on
large-scale corpora are widely used in modern
natural language processing systems. However,
these approaches only obtain a single global rep-
resentation for each word, ignoring their context.
Different from traditional word representations,
contextual embeddings move beyond word-level
semantics in that each token is associated with a
representation that is a function of the entire input
sequence. These context-dependent representa-
tions can capture many syntactic and semantic
properties of words under diverse linguistic
contexts. Previous work (
Peters et al., 2018;
Devlin et al., 2018; Yang et al., 2019; Raffel et al.,
2019) has shown that contextual embeddings pre-
trained on large-scale unlabelled corpora achieve
state-of-the-art performance on a w ide range of
natural language processing tasks, such as text
classification, question answering and text sum-
marization. Further analyses (
Liu et al., 2019a;
Hewitt and Liang, 2019; Hewitt and Manning,
2019; Tenney et al., 2019a) demonstrate that
contextual embeddings are capable of learning
useful and transferable representations across
languages.
The rest of the survey is organized as follows.
In Section
2, we define the concept of contextual
embeddings. In Section
3, we introduce existing
methods for obtaining contextual embeddings. In
Section
4, we present the pre-training methods of
contextual embeddings on multi-lingual corpora.
In Section 5, we describe methods for applying
pre-trained contextual embeddings in downstream
tasks. In Section 6, we detail model compression
methods. In Section
7, we survey analyses that
have aimed to identify the linguistic knowledge
learned by contextual embeddings. We conclude
the survey by highlighting some challenges for fu-
ture research in Section
8.
2 Token Embeddings
Consider a text corpus that is represented as
a sequence S of tokens, (t
1
, t
2
, ..., t
N
). Dis-
tributed representations of words (
Harris, 1954;
Bengio et al., 2003) associate each token t
i
with
a dense feature vector h
t
i
. Traditional word em-
bedding techniques aim to learn a global word em-
bedding matrix E ∈ R
V ×d
, where V is the vo-
cabulary size and d is the number of dimensions.
Specifically, each row e
i
of E corresponds to the
global embedding of word type i in the vocabu-
lary V . Well-known m odels for learning word em-
beddings include Word2vec (
Mikolov et al., 2013)
and Glove (
Pennington et al., 2014). On the
other hand, methods that learn contextual embed-
dings associate each token t
i
with a represen-
tation that is a function of the entire input se-
quence S, i.e. h
t
i
= f (e
t
1
, e
t
2
, ..., e
t
N
), w here
each input token t
j
is usually mapped to its non-
contextualized representation e
t
j
first, before ap-
plying an aggregation function f. These context-
dependent representations are better suited to
capture sequence-level semantics (e.g. polysemy)
than non-contextual word embeddings. There are
many model architectures for f , w hich we review
here. We begin by describing pre-training meth-
ods for learning contextual embeddings that can
be used in downstream tasks.
3 Pre-training Methods for Contextual
Embeddings
In large part, pre-training contextual embeddings
can be divided into either unsupervised methods
(e.g. language modelling and its variants) or super-
vised methods (e.g. machine translation and natu-
ral language inference).
3.1 Unsupervised Pre-training via Language
Modeling
The prototypical way to learn distributed token
embeddings is via language modelling. A lan-
guage model is a probability distribution over a
sequence of tokens. Given a sequence of N to-
kens, (t
1
, t
2
, ..., t
N
), a language model factorizes
the probability of the sequence as:
p(t
1
, t
2
, ..., t
N
) =
N
Y
i=1
p(t
i
|t
1
, t
2
, ..., t
i−1
). (1)
Language modelling uses maximum likelihood
estimation (MLE), often penalized with regular-
ization terms, to estimate model parameters. A
left-to-right language model takes the left con-
text, t
1
, t
2
, ..., t
i−1
, of t
i
into account for esti-
mating the conditional probability. Language
models are usually trained using large-scale un-
labelled corpora. The conditional probabilities
are most commonly learned using neural networks
(
Bengio et al., 2003), and the learned represen-
tations have been proven to be transferable to
downstream natural language understanding tasks
(
Dai and Le, 2015; Ramachandran et al., 2016).
Precursor Models. D ai and Le (
2015) is the first
work we are aware of that uses language modelling
together with a sequence autoencoder to improve
sequence learning with recurrent networks. Thus,
it can be thought of as a precursor to modern con-
textual embedding methods. Pre-trained on the
datasets IMDB, Rotten Tomatoes, 20 Newsgroups,
and DBpedia, the model is then fine-tuned on senti-
ment analysis and text classification tasks, achiev-
ing strong performance compared to randomly-
initialized models.
Ramachandran et al. (
2016) extends Dai and
Le (
2015) by proposing a pre-training method to
improve the accuracy of sequence to sequence
(seq2seq) models. The encoder and decoder of the
seq2seq model is initialized w ith the pre-trained
weights of two language models. These language
models are separately trained on either the News
Crawl English or G erman corpora for machine
translation, while both are initialized w ith the lan-
guage model trained with the English Gigaword
corpus for abstractive summarization. These pre-
trained models are fine-tuned on the WMT En-
glish → German task and the CNN/Daily Mail
corpus, respectively, achieving better results over
baselines without pre-training.
The work in the following sections improves
over Dai and Le (
2015) and Ramachandran et al.
(
2016) with new architectures (e.g. Transformer),
larger datasets, and new pre-training objectives. A
summary of the models and the pre-training objec-
tives is show n in Table
1 and 2.
ELMo. The ELMo model (
Peters et al., 2018)
generalizes traditional word embeddings by ex-
tracting context-dependent representations from a
bidirectional language model. A forward L-layer
LSTM and a backward L-layer LSTM are applied
to encode the left and right contexts, respectively.
At each layer j, the contextualized representations
are the concatenation of the left-to-right and right-
to-left representations, obtaining N hidden repre-
sentations, (h
1,j
, h
2,j
, ..., h
N,j
), for a sequence of
length N.
To use ELMo in downstream tasks, the (L + 1)-
layer representations (including the global word
embedding) for each token k are aggregated as:
ELMO
task
k
= γ
task
L
X
j=0
s
task
j
h
k,j
, (2)
where s
task
are layer-wise weights normalized by
the softmax used to linearly combine the (L + 1)-
layer representations of the token k and γ
task
is a
task-specific constant.
Given a pre-trained ELMo, it is straightforward
to incorporate it into a task-specific architecture
for improving the performance. As most super-
vised models use global word representations x
k
in their lowest layers, these representations can
be concatenated with their corresponding context-
dependent representations ELMO
task
k
, obtaining
Method Architecture Encoder Decoder Objective Dataset
ELMo LSTM ✗ X LM 1B Word Benchmark
GPT Transformer ✗ X LM BookCorpus
GPT2 Transformer ✗ X LM Web pages st arting from Reddit
BERT Transformer X ✗ MLM & NSP BookCorpus & Wiki
RoBERTa Transformer X ✗ MLM BookCorpus, Wiki, CC-News, OpenWebText, Stories
ALBERT Transformer X ✗ MLM & SOP Same as RoBERTa and XLNet
UniLM Transformer X ✗ LM, MLM, seq2seq LM Same as BERT
ELECTRA Transformer X ✗ Discriminator (o/r) Same as XLNet
XLNet Transformer ✗ X PLM BookCorpus, Wiki, Giga5, ClueWeb, Common Crawl
XLM Transformer X X CLM, MLM, TLM Wiki, parellel corpora (e.g. MultiUN)
MASS Transformer X X Span Mask WMT News Crawl
T5 Transformer X X Text Infil ling Colossal Clean Crawled Corpus
BART Transformer X X Text Infil ling & Sent Shuffling Same as RoBERTa
Table 1: A comparison of pop ular pre-trained models.
Objective Inputs Targets
LM [START] I am happy to join with you today
MLM I am [MASK] to join with you [MASK] happy today
NSP Sent1 [SEP] Next Sent or Sent1 [SEP] Random Sent Next Sent/Random Sent
SOP Sent1 [SEP] Sent2 or Sent2 [SEP] Sent1 in order/reversed
Discriminator (o/r) I am thrilled to study with you today o o r o r o o o
PLM happy join with today am I to you
seq2seq LM I am happy to join with you today
Span Mask I am [MASK] [MASK] [MASK] with you today happy to join
Text Infil ling I am [MASK] with you today happy to join
Sent Shuffling today you am I join with happy to I am happy to join with you today
TLM How [MASK] you [SEP] [MASK] vas-tu are Comment
Table 2: Pre-training objectives and their input-output formats.
[x
k
; ELMO
task
k
], before feeding them to higher
layers.
The effectiveness of ELMo is evaluated on six
NLP problems, including question answering, tex-
tual entailment and sentiment analysis.
GPT, GPT2, and Grover. GPT (Radford et al.,
2018) adopts a two-stage learning paradigm: (a)
unsupervised pre-training using a language mod-
elling objective and (b) supervised fine-tuning.
The goal is to learn universal representations trans-
ferable to a wide range of downstream tasks.
To this end, GPT uses the BookCorpus dataset
(Zhu et al., 2015), which contains more than 7,000
books from various genres, for training the lan-
guage model. The Transformer architecture
(
Vaswani et al., 2017) is used to implement the
language model, which has been shown to bet-
ter capture global dependencies from the inputs
compared to its alternatives, e.g. recurrent net-
works, and perform strongly on a range of se-
quence learning tasks, such as machine transla-
tion (Vaswani et al., 2017) and document gener-
ation (
Liu et al., 2018). To use GPT on inputs
with multiple sequences during fine-tuning, GPT
applies task-specific input adaptations motivated
by traversal-style approaches (Rockt¨aschel et al.,
2015). These approaches pre-process each text
input as a single contiguous sequence of tokens
through special tokens including [START] (the
start of a sequence), [DELIM] (delimiting two se-
quences from the text input) and [EXTRACT] (the
end of a sequence). GPT outperforms task-specific
architectures in 9 out of 12 tasks studied with a pre-
trained Transformer.
GPT2 (
Radford et al., 2019) mainly follows the
architecture of GPT and trains a language model
on a dataset as large and diverse as possible to
learn from varied domains and contexts. To do
so,
Radford et al. (2019) create a new dataset of
millions of web pages named WebText, by scrap-
ing outbound links from Reddit. The authors ar-
gue that a language model trained on large-scale
unlabelled corpora begins to learn some common
supervised NLP tasks, such as question answer-
ing, machine translation and summarization, with-
out any explicit supervision signal. To validate
this, GPT2 is tested on ten datasets (e.g. Chil-
dren’s Book Test (
Hill et al., 2015), LAMBADA
(Paperno et al., 2016) and CoQA (Reddy et al.,
剩余12页未读,继续阅读
syp_net
- 粉丝: 158
- 资源: 1196
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
- SPC统计方法基础知识.pptx
- MW全能培训汽轮机调节保安系统PPT教学课件.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0