Input/Output Representations To make BERT
handle a variety of down-stream tasks, our input
representation is able to unambiguously represent
both a single sentence and a pair of sentences
(e.g., h Question, Answer i) in one token sequence.
Throughout this work, a “sentence” can be an arbi-
trary span of contiguous text, rather than an actual
linguistic sentence. A “sequence” refers to the in-
put token sequence to BERT, which may be a sin-
gle sentence or two sentences packed together.
We use WordPiece embeddings (Wu et al.,
2016) with a 30,000 token vocabulary. The first
token of every sequence is always a special clas-
sification token ([CLS]). The final hidden state
corresponding to this token is used as the ag-
gregate sequence representation for classification
tasks. Sentence pairs are packed together into a
single sequence. We differentiate the sentences in
two ways. First, we separate them with a special
token ([SEP]). Second, we add a learned embed-
ding to every token indicating whether it belongs
to sentence A or sentence B. As shown in Figure 1,
we denote input embedding as E, the final hidden
vector of the special [CLS] token as C ∈ R
H
,
and the final hidden vector for the i
th
input token
as T
i
∈ R
H
.
For a given token, its input representation is
constructed by summing the corresponding token,
segment, and position embeddings. A visualiza-
tion of this construction can be seen in Figure 2.
3.1 Pre-training BERT
Unlike Peters et al. (2018a) and Radford et al.
(2018), we do not use traditional left-to-right or
right-to-left language models to pre-train BERT.
Instead, we pre-train BERT using two unsuper-
vised tasks, described in this section. This step
is presented in the left part of Figure 1.
Task #1: Masked LM Intuitively, it is reason-
able to believe that a deep bidirectional model is
strictly more powerful than either a left-to-right
model or the shallow concatenation of a left-to-
right and a right-to-left model. Unfortunately,
standard conditional language models can only be
trained left-to-right or right-to-left, since bidirec-
tional conditioning would allow each word to in-
directly “see itself”, and the model could trivially
predict the target word in a multi-layered context.
former is often referred to as a “Transformer encoder” while
the left-context-only version is referred to as a “Transformer
decoder” since it can be used for text generation.
In order to train a deep bidirectional representa-
tion, we simply mask some percentage of the input
tokens at random, and then predict those masked
tokens. We refer to this procedure as a “masked
LM” (MLM), although it is often referred to as a
Cloze task in the literature (Taylor, 1953). In this
case, the final hidden vectors corresponding to the
mask tokens are fed into an output softmax over
the vocabulary, as in a standard LM. In all of our
experiments, we mask 15% of all WordPiece to-
kens in each sequence at random. In contrast to
denoising auto-encoders (Vincent et al., 2008), we
only predict the masked words rather than recon-
structing the entire input.
Although this allows us to obtain a bidirec-
tional pre-trained model, a downside is that we
are creating a mismatch between pre-training and
fine-tuning, since the [MASK] token does not ap-
pear during fine-tuning. To mitigate this, we do
not always replace “masked” words with the ac-
tual [MASK] token. The training data generator
chooses 15% of the token positions at random for
prediction. If the i-th token is chosen, we replace
the i-th token with (1) the [MASK] token 80% of
the time (2) a random token 10% of the time (3)
the unchanged i-th token 10% of the time. Then,
T
i
will be used to predict the original token with
cross entropy loss. We compare variations of this
procedure in Appendix C.2.
Task #2: Next Sentence Prediction (NSP)
Many important downstream tasks such as Ques-
tion Answering (QA) and Natural Language Infer-
ence (NLI) are based on understanding the rela-
tionship between two sentences, which is not di-
rectly captured by language modeling. In order
to train a model that understands sentence rela-
tionships, we pre-train for a binarized next sen-
tence prediction task that can be trivially gener-
ated from any monolingual corpus. Specifically,
when choosing the sentences A and B for each pre-
training example, 50% of the time B is the actual
next sentence that follows A (labeled as IsNext),
and 50% of the time it is a random sentence from
the corpus (labeled as NotNext). As we show
in Figure 1, C is used for next sentence predic-
tion (NSP).
5
Despite its simplicity, we demon-
strate in Section 5.1 that pre-training towards this
task is very beneficial to both QA and NLI.
6
5
The final model achieves 97%-98% accuracy on NSP.
6
The vector C is not a meaningful sentence representation
without fine-tuning, since it was trained with NSP.