BART: Denoising Sequence-to-Sequence Pre-training for Natural
Language Generation, Translation, and Comprehension
Mike Lewis*, Yinhan Liu*, Naman Goyal*, Marjan Ghazvininejad,
Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer
Facebook AI
{mikelewis,yinhanliu,naman}@fb.com
Abstract
We present BART, a denoising autoencoder
for pretraining sequence-to-sequence models.
BART is trained by (1) corrupting text with an
arbitrary noising function, and (2) learning a
model to reconstruct the original text. It uses
a standard Tranformer-based neural machine
translation architecture which, despite its sim-
plicity, can be seen as generalizing BERT (due
to the bidirectional encoder), GPT (with the
left-to-right decoder), and many other more re-
cent pretraining schemes. We evaluate a num-
ber of noising approaches, finding the best per-
formance by both randomly shuffling the or-
der of the original sentences and using a novel
in-filling scheme, where spans of text are re-
placed with a single mask token. BART is
particularly effective when fine tuned for text
generation but also works well for compre-
hension tasks. It matches the performance of
RoBERTa with comparable training resources
on GLUE and SQuAD, achieves new state-
of-the-art results on a range of abstractive di-
alogue, question answering, and summariza-
tion tasks, with gains of up to 6 ROUGE.
BART also provides a 1.1 BLEU increase over
a back-translation system for machine transla-
tion, with only target language pretraining. We
also report ablation experiments that replicate
other pretraining schemes within the BART
framework, to better measure which factors
most influence end-task performance.
1 Introduction
Self-supervised methods have achieved remarkable
success in a wide range of NLP tasks (Mikolov et al.,
2013; Peters et al., 2018; Devlin et al., 2019; Joshi
et al., 2019; Yang et al., 2019; Liu et al., 2019).
The most successful approaches have been variants of
masked language models, which are denoising autoen-
coders that are trained to reconstruct text where a ran-
dom subset of the words has been masked out. Recent
work has shown gains by improving the distribution of
masked tokens (Joshi et al., 2019), the order in which
masked tokens are predicted (Yang et al., 2019), and the
available context for replacing masked tokens (Dong
et al., 2019). However, these methods typically focus
on particular types of end tasks (e.g. span prediction,
generation, etc.), limiting their applicability.
In this paper, we present BART, which pre-trains
a model combining Bidirectional and Auto-Regressive
Transformers. BART is a denoising autoencoder built
with a sequence-to-sequence model that is applicable
to a very wide range of end tasks. Pretraining has
two stages (1) text is corrupted with an arbitrary nois-
ing function, and (2) a sequence-to-sequence model is
learned to reconstruct the original text. BART uses a
standard Tranformer-based neural machine translation
architecture which, despite its simplicity, can be seen as
generalizing BERT (due to the bidirectional encoder),
GPT (with the left-to-right decoder), and many other
more recent pretraining schemes (see Figure 1).
A key advantage of this setup is the noising flexibil-
ity; arbitrary transformations can be applied to the orig-
inal text, including changing its length. We evaluate
a number of noising approaches, finding the best per-
formance by both randomly shuffling the order of the
original sentences and using a novel in-filling scheme,
where arbitrary length spans of text (including zero
length) are replaced with a single mask token. This ap-
proach generalizes the original word masking and next
sentence prediction objectives in BERT by forcing the
model to reason more about overall sentence length and
make longer range transformations to the input.
BART is particularly effective when fine tuned for
text generation but also works well for comprehen-
sion tasks. It matches the performance of RoBERTa
(Liu et al., 2019) with comparable training resources
on GLUE (Wang et al., 2018) and SQuAD (Rajpurkar
et al., 2016), and achieves new state-of-the-art results
on a range of abstractive dialogue, question answer-
ing, and summarization tasks. For example, it im-
proves performance by 6 ROUGE over previous work
on XSum (Narayan et al., 2018).
BART also opens up new ways of thinking about fine
tuning. We present a new scheme for machine transla-
tion where a BART model is stacked above a few ad-
ditional transformer layers. These layers are trained
to essentially translate the foreign language to noised
arXiv:1910.13461v1 [cs.CL] 29 Oct 2019