optimization hyperparameters, given in Section 2,
except for the peak learning rate and number of
warmup steps, which are tuned separately for each
setting. We additionally found training to be very
sensitive to the Adam epsilon term, and in some
cases we obtained better performance or improved
stability after tuning it. Similarly, we found setting
β
2
= 0.98 to improve stability when training with
large batch sizes.
We pretrain with sequences of at most T = 512
tokens. Unlike
Devlin et al. (2019), we do not ran-
domly inject short sequences, and we do not train
with a reduced sequence length for the first 90% of
updates. We train only with full-length sequences.
We train with mixed precision floating point
arithmetic on DGX-1 machines, each with 8 ×
32GB Nvidia V100 GPUs interconnected by In-
finiband (
Micikevicius et al., 2018).
3.2 Data
BERT-style pretraining crucially relies on large
quantities of text.
Baevski et al. (2019) demon-
strate that increasing data size can result in im-
proved end-task performance. Several efforts
have trained on datasets larger and more diverse
than the original BERT (
Radford et al., 2019;
Yang et al., 2019; Zellers et al., 2019). Unfortu-
nately, not all of the additional datasets can be
publicly released. For our study, we focus on gath-
ering as much data as possible for experimenta-
tion, allowing us to match the overall quality and
quantity of data as appropriate for each compari-
son.
We consider five English-language corpora of
varying sizes and domains, totaling over 160GB
of uncompressed text. We use the following text
corpora:
• BOOKCORPUS (
Zhu et al., 2015) plus English
WIKIPEDIA. This is the original data used to
train BERT. (16GB).
• CC-NEWS, which we collected from the En-
glish portion of the CommonCrawl News
dataset (
Nagel, 2016). The data contains 63
million English news articles crawled between
September 2016 and February 2019. (76GB af-
ter filtering).
4
• OPENWEBTEXT (Gokaslan and Cohen, 2019),
an open-source recreation of the WebText cor-
4
We use news-please (
Hamborg et al., 2017) to col-
lect and extract CC-NEWS. CC-NEWS is similar to the RE-
ALNEWS dataset described in
Zellers et al. (2019).
pus described in
Radford et al. (2019). The text
is web content extracted from URLs shared on
Reddit with at least three upvotes. (38GB).
5
• STORIES, a dataset introduced in Trinh and Le
(2018) containing a subset of CommonCrawl
data filtered to match the story-like style of
Winograd schemas. (31GB).
3.3 Evaluation
Following previous work, we evaluate our pre-
trained models on downstream tasks using the fol-
lowing three benchmarks.
GLUE The General Language Understand-
ing Evaluation (GLUE) benchmark (Wang et al. ,
2019b) is a collection of 9 datasets for evaluating
natural language understanding systems.
6
Tasks
are framed as either single-sentence classification
or sentence-pair classification tasks. The GLUE
organizers provide training and development data
splits as well as a submission server and leader-
board that allows participants to evaluate and com-
pare their systems on private held-out test data.
For the replication study in Section
4, we report
results on the development sets after finetuning
the pretrained models on the corresponding single-
task training data (i.e., without multi-task training
or ensembling). Our finetuning procedure follows
the original BERT paper (
Devlin et al., 2019).
In Section 5 we additionally report test set re-
sults obtained from the public leaderboard. These
results depend on a several task-specific modifica-
tions, which we describe in Section
5.1.
SQuAD The Stanford Question Answering
Dataset (SQuAD) provides a paragraph of context
and a question. The task is to answer the question
by extracting the relevant span from the context.
We evaluate on two versions of SQuAD: V1.1
and V2.0 (
Rajpurkar et al., 2016, 2018). In V1.1
the context always contains an answer, whereas in
5
The authors and their affiliated institutions are not in any
way affiliated with the creati on of the OpenWebText dataset.
6
The datasets are: CoLA (
Warstadt et al., 2018),
Stanford Sentiment Treebank (SST) (
Socher et al.,
2013), Microsoft Research Paragraph Corpus
(MRPC) (
Dolan and Brockett, 2005), Semantic Tex-
tual Similarity Benchmark (STS) (
Agirre et al., 2007),
Quora Question Pairs ( Q QP) (
Iyer et al., 2016), Multi-
Genre NLI (MNLI) (
Williams et al., 2018 ) , Question NLI
(QNLI) (
Rajpurkar et al., 2016), Recognizing Textual
Entailment ( RTE) (
Dagan et al., 2006; Bar-Haim et al.,
2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) and
Winograd NLI (WNLI) (
Levesque et al., 2011).