5
Pre-training via Language Modelling
BERT [9] (Bi-Directional Encoder Representations from
Transformers) is an extension of the Transformer archi-
tecture and comes with a specific semi-supervised learning
training regimen: BERT heavily relies on pre-training, a
form of unsupervised learning, before being fine-tuned on a
downstream task in a classical supervised fashion.
During pre-training, BERT is trained on large amounts
of unlabeled data via Mask Language Modelling (MLM).
MLM is a prediction task where some of the input tokens
are randomly replaced by blanks (“masked”) and the model
is trained to predict the tokens behind these blanks, taking
into account the textual context on both sides of the blank
(see the BERT paper for more details on the pre-training
itself [9]). Intuitively, this general task is supposed to initial-
ize the weights to a state in which certain general concepts
and relationships useful for a large number of downstream
tasks are already present: BERT learns a Representation of the
tokens. Unlike word embeddings [59], these are contextual
representations: they depend both on the token, and its
surrounding tokens.
Of note, earlier work also used Language Modelling as
a pre-training task (ELMo and ULMFit [7], [8]) with LSTMs,
and were used with some varying amount of success in
Software Engineering [13], [14]. BERT’s pre-training is more
efficient for two reasons: BERT’s bidirectional architecture
uses the context before and after the token, whereas LSTMs
use only the context before the token; and BERT uses Byte-
Pair Encoding (BPE) [60] to tokenise text in subwords rather
than entire words, leading to better modelling of the vo-
cabulary (see previous work by Karampatsis et al. for an
extended discussion of this aspect for source code [42]).
RoBERTa [10] is a refinement of BERT, in particular relat-
ing to its pre-training regimen (e.g., RoBERTa uses a larger
pre-training corpus, dynamic masking, and a variation of
the pre-training task) and with only minor architectural
changes (RoBERTa uses Byte-level BPE tokenization, rather
than character-level BPE).
Fine-tuning
Both BERT and RoBERTa are hardly ever trained from
scratch. Instead, starting from a pre-trained model with
pre-initialized weights, the model weights are further fine-
tuned by training on task-specific labeled data (called a
downstream task). This involves replacing the last layer of
the model (useful for the pre-training task), with a task-
specific layer, and resuming training. The model can lever-
age the pre-trained representations to be able to learn the
downstream task effectively, even with a limited amount of
data, allowing BERT and RoBERTa to set the state of the art
on NLP benchmarks, even on tasks with limited data (the
GLUE benchmark [11] includes several task with less than
10,000 examples).
Impact of the Pre-training corpora
The standard BERT and RoBERTa models have both been
pre-trained on a large English natural language corpus, with
several models available in various sizes. There exist pre-
trained BERT models for many other natural languages and
even programming languages [61]. Intuitively, one would
EN Leppie, that’s great news! I look forward to trying
IronScheme!
EN →DE Leppie, das sind großartige Neuigkeiten! Ich freue
mich darauf, IronScheme auszuprobieren!
DE →EN leppie, those are great news! I am looking forward to
try out IronScheme!
EN →FR Leppie, c’est une excellente nouvelle! J’ai hâte
d’essayer IronScheme!
FR →EN leppie, this is great news! I can’t wait to try Iron-
Scheme!
Fig. 1. Example of back-translation. The original English sentence is
first translated to German and French, then translated back into En-
glish; resulting variation underlined. Google Translate was used for the
translation.
expect a generic pre-training corpus to be a “jack of all
trades, master of none”, with a more specific pre-training
corpus to be more suited for more specific domains (such
as software engineering). There is evidence of this for word
embeddings in Software Engineering [62], but how much of
an impact a domain-specific pre-training corpus has for a
BERT or RoBERTa model is still an open question, which we
investigate. Of note, the ULMFit approach [8] continues the
pre-training task on the task-specific data (without using
labels), before the actual fine-tuning, finding that it does
improve performance.
3.2 Additional Techniques
Intermediate-Task Fine-Tuning
Intermediate-task fine-tuning (ITT), also known as two stage
fine-tuning, STILTs [63], or TANDA [64] is a technique
whereby the model is fine-tuned twice (with labeled data):
first on an intermediate task, a task different from but closely
related to the target task, and finally on the actual target
task (e.g., training for sentiment analysis on movies, before
switching to sentiment analysis on books). This is particu-
larly attractive whenever only little data is available for the
target task whilst large amounts of data are available for a
similar, possibly slightly simpler, but different intermediate
task. The idea is that the target task might benefit from
“knowledge” that the model acquired during intermediate-
task training. Pruksachatkun et al. [65] presents a survey on
when this method offers good prospects in NLP.
Self-Training
Self-training (also known as self-labelling or self-
learning) [66], [67], is a very simple semi-supervised
learning method. It can be explained as follows: A model
is first trained on a (possibly too small) labeled dataset.
Next, this model is used to evaluate a number of additional
unlabeled samples. The model’s predictions for these
unlabeled samples are then simply used as their gold labels.
We now have additional labeled data, albeit noisier ones;
after adding it to the original dataset we retrain the model.
Predictions can be filtered by confidence to reduce the
probability of introducing noise into the training set.
Data Augmentation and Back-Translation
Data augmentation is a well-known technique to increase
the amount of labeled data without any human labeling