需积分: 0 981 浏览量 更新于2023-05-28 评论 2 收藏 489KB PDF 举报
Pre-trained Models for Natural Language Processing: A Survey
, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai & Xuanjing Huang
School of Computer Science, Fudan University, Shanghai 200433, China;
Shanghai Key Laboratory of Intelligent Information Processing, Shanghai 200433, China
Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey,
we provide a comprehensive review of PTMs for NLP. We ﬁrst brieﬂy introduce language representation learning and its research
progress. Then we systematically categorize existing PTMs based on a taxonomy with four perspectives. Next, we describe how to
adapt the knowledge of PTMs to the downstream tasks. Finally, we outline some potential directions of PTMs for future research.
This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.
Deep Learning, Neural Network, Natural Language Processing, Pre-trained Model, Distributed Representation, Word
Embedding, Self-Supervised Learning, Language Modelling
With the development of deep learning, various neural net-
works have been widely used to solve Natural Language Pro-
cessing (NLP) tasks, such as convolutional neural networks
], recurrent neural networks (RNNs) [
], graph-based neural networks (GNNs) [
and attention mechanisms [
]. One of the advantages
of these neural models is their ability to alleviate the fea-
ture engineering problem. Non-neural NLP methods usually
heavily rely on the discrete handcrafted features, while neural
methods usually use low-dimensional and dense vectors (aka.
distributed representation) to implicitly represent the syntactic
or semantic features of the language. These representations
are learned in speciﬁc NLP tasks. Therefore, neural methods
make it easy for people to develop various NLP systems.
Despite the success of neural models for NLP tasks, the
performance improvement may be less signiﬁcant compared
to the Computer Vision (CV) ﬁeld. The main reason is that
current datasets for most supervised NLP tasks are rather small
(except machine translation). Deep neural networks usually
have a large number of parameters which make them overﬁt
on these small training data and do not generalize well in
practice. Therefore, the early neural models for many NLP
tasks were relatively shallow and usually consisted of only
1∼3 neural layers.
Recently, substantial work has shown that pre-trained mod-
els (PTMs) on the large corpus can learn universal language
representations, which are beneﬁcial for downstream NLP
tasks and can avoid training a new model from scratch. With
the development of computational power, the emergence of
the deep models (i.e., Transformer [
]) and the constant
enhancement of training skills, the architecture of PTMs has
been advanced from shallow to deep. The ﬁrst-generation
PTMs aim to learn good word embeddings. Since these mod-
els themselves are no longer needed by downstream tasks, they
are usually very shallow for computational eﬃciencies, such
as Skip-Gram [
] and GloVe [
]. Although these pre-
trained embeddings can capture semantic meanings of words,
they are context-free and fail to capture higher-level concepts
of text like syntactic structures, semantic roles, anaphora, etc.
* Corresponding author (email: firstname.lastname@example.org)
arXiv:2003.08271v1 [cs.CL] 18 Mar 2020
2 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
The second-generation PTMs focus on learning contextual
word embeddings, such as CoVe [
], ELMo [
] and BERT [
]. These learned encoders are still
needed to represent words in context by downstream tasks.
Besides, various pre-training tasks are also proposed to learn
PTMs for diﬀerent purposes.
The contributions of this survey can be summarized as
Comprehensive review. We provide a comprehensive
review of PTMs for NLP, including background knowl-
edge, model architecture, pre-training tasks, various
extensions, adaption approaches, and applications. We
provide detailed descriptions of representative models,
make the necessary comparison, and summarise the
New taxonomy. We propose a taxonomy of PTMs for
NLP, which categorizes existing PTMs from four dif-
ferent perspectives: 1) type of word representation; 2)
architecture of PTMs; 3) type of pre-training tasks; 4)
extensions for speciﬁc types of scenarios or inputs.
Abundant resources. We collect abundant resources on
PTMs, including open-source systems, paper lists, etc.
Future directions. We discuss and analyze the limi-
tations of existing PTMs. Also, we suggest possible
future research directions.
The rest of the survey is organized as follows. Section 2
outlines the background concepts and commonly used nota-
tions of PTMs. Section 3 gives a brief overview of PTMs
and clariﬁes the categorization of PTMs. Section 4 provides
extensions of PTMs. Section 5 discusses how to transfer the
knowledge of PTMs to downstream tasks. Section 6 gives the
related resources on PTMs, including open-source systems,
paper lists, etc. Section 7 presents a collection of applications
across various NLP tasks. Section 8 discusses the current chal-
lenges and suggests future directions. Section 9 summarizes
2.1 Language Representation Learning
As suggested by Bengio et al.
, a good representation
should express general-purpose priors that are not task-speciﬁc
but would be likely to be useful for a learning machine to solve
AI-tasks. When it comes to language, a good representation
should capture the implicit linguistic rules and common sense
knowledge hiding in text data, such as lexical meanings, syn-
tactic structures, semantic roles, and even pragmatics.
The core idea of distributed representation is to describe the
meaning of a piece of text by low-dimensional real-valued vec-
tors. And each dimension of the vector has no corresponding
sense while the whole represents a concrete concept. Figure 1
illustrates the generic neural architecture for NLP. There are
two kinds of word embeddings: non-contextual and contex-
tual embeddings. The diﬀerence between them is whether the
embedding for a word dynamically changes according to the
context it appears in.
Figure 1: Generic Neural Architecture for NLP.
To represent language, the
ﬁrst step is to map discrete language symbols into a distributed
embedding space. Formally, for each word (or sub-word)
, we map it to a vector
with a lookup
E ∈ R
is a hyper-parameter indicating
the dimension of token embeddings. These embeddings are
trained on task data along with other model parameters.
There are two main limitations to this kind of embeddings.
The ﬁrst issue is that the embeddings are static. The embed-
ding for a word does is always the same regardless of its
context. Therefore, these non-contextual embeddings fail to
model polysemous words. The second issue is the out-of-
vocabulary problem. To tackle this problem, character-level
word representations or sub-word representations are widely
used in many NLP tasks, such as CharCNN [
], FastText [
and Byte-Pair Encoding (BPE) .
To address the issue of polyse-
mous and the context-dependent nature of words, we need
distinguish the semantics of words in diﬀerent contexts. Given
, · · · , x
where each token
is a word or
sub-word, the contextual representation of
depends on the
, · · · , h
] = f
, · · · , x
) is neural encoder, which is described in Sec-
is called contextual embedding or dynamical em-
bedding of token
because of the contextual information
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 3
(a) Convolutional model
(b) Sequential model
(c) Fully-connected graph-based model
Figure 2: Neural Contextual Encoders
2.2 Neural Contextual Encoders
Most of the neural contextual encoders can be classiﬁed into
three categories: convolutional models, sequential models,
and graph-based models. Figure 2 illustrates the architecture
of these models.
(1) Convolutional models. Convolutional models take the
embeddings of words in the input sentence and capture the
meaning of a word by aggregating the local information from
its neighbors by convolution operations .
Convolutional models are usually easy to train and can
capture the local contextual information.
(2) Sequential models. Sequential models usually adopt
RNNs (such as LSTM [
] and GRU [
]) to capture the con-
textual representation of a word. In practice, bi-directional
RNNs are used to collect information from both sides of a
word, but its performance is often aﬀected by the long-term
(3) Graph-based models. Diﬀerent from the above models,
graph-based models take the word as nodes and learn the con-
textual representation with a pre-deﬁned linguistic structure
between words, such as the syntactic structure [
semantic relation .
Although the linguistic-aware graph structure can provide
useful inductive bias, how to build a good graph structure is
also a challenging problem. Besides, the structure depends
heavily on expert knowledge or external NLP tools, such as
the dependency parser.
In practice, a more straightforward way is to use a fully-
connected graph to model the relation of every two words and
let the model learn the structure by itself. Usually, the connec-
tion weights are dynamically computed by the self-attention
mechanism, which implicitly indicates the connection between
A successful implementation of such an idea is the Trans-
], which adopts the fully-connected self-attention
architecture as well as other useful designs, such as positional
embeddings, layer normalization, and residual connections.
Both convolutional and sequential models learn
the contextual representation of the word with locality bias
and are hard to capture the long-range interactions between
words. In contrast, Transformer can directly model the depen-
dency between every two words in a sequence, which is more
powerful and suitable to model the language.
However, due to its heavy structure and less model bias,
Transformer usually requires a large training corpus and is
easy to overﬁt on small or modestly-sized datasets [130, 49].
2.3 Why Pre-training?
With the development of deep learning, the number of model
parameters has increased rapidly. The much larger dataset is
needed to fully train model parameters and prevent overﬁt-
ting. However, building large-scale labeled datasets is a great
challenge for most NLP tasks due to the extremely expen-
sive annotation costs, especially for syntax and semantically
In contrast, large-scale unlabeled corpora are relatively easy
to construct. To leverage the huge unlabeled text data, we can
ﬁrst learn a good representation from them and then use these
representations for other tasks. Recent studies have demon-
strated signiﬁcant performance gains on many NLP tasks with
the help of the representation extracted from the PTMs on the
large unannotated corpora.
The advantages of pre-training can be summarized as fol-
Pre-training on the huge text corpus can learn universal
language representations and help with the downstream
Pre-training provides a better model initialization,
which usually leads to a better generalization perfor-
mance and speeds up convergence on the target task.
Pre-training can be regarded as a kind of regularization
to avoid overﬁtting on small data .
2.4 A Brief History of PTMs for NLP
Pre-training has always been an eﬀective strategy to learn the
parameters of deep neural networks, which are then ﬁne-tuned
on downstream tasks. As early as 2006, the breakthrough
of deep learning came with greedy layer-wise unsupervised
pre-training followed by supervised ﬁne-tuning [
]. In CV, it
4 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
has been in practice to pre-train models on the huge ImageNet
corpus, and then ﬁne-tune further on smaller data for diﬀerent
tasks. This is much better than a random initialization because
the model learns general image features which can then be
used in various vision tasks.
In NLP, PTMs on large corpus have also been proved to be
beneﬁcial for the downstream NLP tasks, from the shallow
word embedding to deep neural models.
2.4.1 Pre-trained word embeddings
Representing words as dense vectors has a long history [
The “modern” word embedding is introduced in pioneer work
of neural network language model (NNLM) [
showed that the pre-trained word embedding on the
unlabelled data can signiﬁcantly improve many NLP tasks.
To address the computational complexity, they learned word
embeddings with pairwise ranking task instead of language
modeling. Their work is the ﬁrst attempt to obtain generic
word embeddings useful for other tasks from unlabeled data.
Mikolov et al.
showed that there is no need for deep
neural networks to build good word embeddings. They pro-
pose two shallow architectures: Continuous Bag-of-Words
(CBOW) and Skip-Gram (SG) models. Although the pro-
posed models are simple and shallow, they can still learn the
eﬀective word embeddings capturing the latent syntactic and
semantic similarities. Word2vec is one of the most popular
implementations of these models and makes the pre-trained
word embeddings accessible for diﬀerent tasks in NLP. Be-
sides, GloVe [
] is also a widely-used model for obtaining
pre-trained word embeddings, which are computed by global
word-word co-occurrence statistics from a corpus.
Although pre-trained word embeddings have been shown ef-
fective in NLP tasks, they are context-independent and mostly
trained by shallow models. When used in a downstream task,
the rest of the whole model still needs to be learned from
During the same time period, many researchers also try to
learn embeddings of paragraph, sentence or document, such
as paragraph vector [
], Skip-thought vectors [
] and so on. Diﬀerent from their modern suc-
cessors, these sentence embedding models try to encode in-
put sentences into a ﬁxed-dimensional vector representation,
rather than the contextual representation for each token.
2.4.2 Pre-trained contextual encoders
Since most NLP tasks are beyond word-level, it is natural to
pre-train the neural encoders on sentence-level or higher. The
output vectors of neural encoders are also called contextual
word embeddings since they represent the word semantics
depending on its context.
McCann et al.
pre-trained a deep LSTM encoder
from an attentional sequence-to-sequence model with ma-
chine translation (MT). The context vectors (CoVe) output by
the pretrained encoder can improve the performance of a wide
variety of common NLP tasks. Peters et al.
2-layer LSTM encoder with a bidirectional language model
(BiLM), consisting of a forward LM and a backward LM. The
contextual representations output by the pre-trained BiLM,
ELMo (Embeddings from Language Models), are shown to
bring large improvements on a broad range of NLP tasks. Ak-
bik et al.
captured word meaning with contextual string
embeddings pre-trained with character-level LM.
However, these PTMs are usually used as a feature extrac-
tor to produce the contextual word embeddings, which are fed
into the main model for downstream tasks. Their parameters
are ﬁxed and the rest parameters of the main model are still
trained from scratch.
Ramachandran et al.
found the seq2seq models can
be signiﬁcantly improved by unsupervised pre-training. The
weights of both encoder and decoder are initialized with pre-
trained weights of two language models and then ﬁne-tuned
with labeled data. ULMFiT (Universal Language Model Fine-
] attempted to ﬁne-tune pre-trained LM for text
classiﬁcation (TC) and achieved state-of-the-art results on six
widely-used TC datasets. ULMFiT consists of 3 phases: 1)
pre-training LM on general-domain data; 2) ﬁne-tuning LM on
target data; 3) ﬁne-tuning on the target task. ULMFiT also in-
vestigates some eﬀective ﬁne-tuning strategies, including dis-
criminative ﬁne-tuning, slanted triangular learning rates, and
gradual unfreezing. Since ULMFiT, ﬁne-tuning has become
the mainstream approach to adapt PTMs for the downstream
More recently, the very deep PTMs have shown their pow-
erful ability in learning universal language representations:
e.g., OpenAI GPT (Generative Pre-training) [
] and BERT
(Bidirectional Encoder Representation from Transformer) [
Besides LM, an increasing number of self-supervised tasks
(see Section 3.1) are proposed to make the PTMs capturing
more knowledge form large scale text corpora.
3 Overview of PTMs
The major diﬀerences between PTMs are the usages of con-
textual encoders, pre-training tasks, and purposes. We have
brieﬂy introduced the architectures of contextual encoders in
Section 2.2. In this section, we focus on the description of
pre-training tasks and give a taxonomy of PTMs.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 5
3.1 Pre-training Tasks
The pre-training tasks are crucial for learning the universal
representation of language. Usually, these pre-training tasks
should be challenging and have substantial training data. In
this section, we summarize the pre-training tasks into three
: supervised learning, unsupervised learning, and
Supervised learning is to learn a function that maps an
input to an output based on training data consisting of
Unsupervised learning is to ﬁnd some intrinsic knowl-
edge, such as clusters, densities, latent representations,
from unlabeled data.
Self-Supervised learning (SSL) is a blend of supervised
learning and unsupervised learning. The key idea of
SSL is to predict any part of the input from other parts
in some form. For example, the masked language model
(MLM) is a self-supervised task that attempts to predict
the masked words in a sentence given the rest words.
In CV, many PTMs are trained on large supervised training
sets like ImageNet. However, in NLP ﬁeld, the datasets of
most supervised tasks are not large enough to train a good
PTM. The only exception is machine translation (MT). A
large-scale MT dataset, WMT 2017, consists of more than
7 million sentence pairs. Besides, MT is one of the most
challenging tasks in NLP, and an encoder pretrained on MT
can beneﬁt a variety of downstream NLP tasks. As a success-
ful PTM, CoVe [
] is an encoder pretrained on MT task
and improves a wide variety of common NLP tasks: senti-
ment analysis (SST, IMDb), question classiﬁcation (TREC),
entailment (SNLI), and question answering (SQuAD).
The pre-training tasks widely-used in existing PTMs are
listed as follows:
3.1.1 Language Modeling (LM)
The most common unsupervised task in NLP is probabilistic
language modeling (LM), which is a classic probabilistic den-
sity estimation problem. Although LM is a general concept,
in practice, LM often refers in particular to auto-regressive
LM or unidirectional LM.
Given a text sequence
, · · · , x
], its joint prob-
) can be decomposed as
is special token indicating the begin of sequence.
The conditional probability
) can be modeled by
a probability distribution over the vocabulary given linguistic
. The context
is modeled by neural encoder
(·), and the conditional probability is
) = g
(·) is prediction layer.
Given a huge corpus, we can train the entire network with
a maximum-likelihood estimation (MLE).
A drawback of unidirectional LM is that the representation
of each token encodes only the leftward context tokens and it-
self. However, better contextual representations of text should
encode contextual information from both directions. An im-
proved solution is bidirectional LM (BiLM), which consists
of two unidirectional LMs: a forward left-to-right LM and a
backward right-to-left LM.
For BiLM, Baevski et al.
proposed a two-tower model
that the forward tower operates the left-to-right LM and the
backward tower operates the right-to-left LM.
3.1.2 Masked Language Modeling (MLM)
Masked language modeling (MLM) is ﬁrst proposed by Tay-
in the literature, who referred this as a Cloze task.
Devlin et al.
adapted this task as a novel pre-training task
to overcome the drawback of the standard unidirectional LM.
Loosely speaking, MLM ﬁrst masks out some tokens from the
input sentences and then trains the model to predict the masked
tokens by the rest of tokens. However, this pre-training method
will create a mismatch between the pre-training phase and the
ﬁne-tuning phase, because the mask token does not appear
during the ﬁne-tuning phase. Empirically, to deal with this
issue, Devlin et al.
used a special
token 80% of
the time, a random token 10% of the time and the original
token 10% of the time to perform masking.
Sequence-to-Sequence MLM (Seq2Seq MLM)
usually solved as classiﬁcation problem. We feed the masked
sequences to a neural encoder, whose output vectors are fur-
ther fed into a softmax classiﬁer to predict the masked token.
Alternatively, we can use encoder-decoder (aka. sequence-
to-sequence) architecture for MLM, in which the encoder is
fed a masked sequence and the decoder sequentially produces
the masked tokens in auto-regression fashion. We refer to
this kind of MLM as sequence-to-sequence MLM (Seq2Seq
MLM), which is used in MASS [
] and T5 [
MLM can beneﬁt the Seq2Seq-style downstream tasks, such
as question answering, summarization and machine transla-
Indeed, it is hard to clearly distinguish the unsupervised learning and self-supervised learning. For clariﬁcation, we refer “unsupervised learning” to the
learning without human-annotated supervised labels”.
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额