pre-trained language or machine translation model as auxiliary features while training a supervised
model on the target task. This involves a substantial amount of new parameters for each separate
target task, whereas we require minimal changes to our model architecture during transfer.
Auxiliary training objectives
Adding auxiliary unsupervised training objectives is an alternative
form of semi-supervised learning. Early work by Collobert and Weston [
10
] used a wide variety of
auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language modeling
to improve semantic role labeling. More recently, Rei [
50
] added an auxiliary language modeling
objective to their target task objective and demonstrated performance gains on sequence labeling
tasks. Our experiments also use an auxiliary objective, but as we show, unsupervised pre-training
already learns several linguistic aspects relevant to target tasks.
3 Framework
Our training procedure consists of two stages. The first stage is learning a high-capacity language
model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to
a discriminative task with labeled data.
3.1 Unsupervised pre-training
Given an unsupervised corpus of tokens
U = {u
1
, . . . , u
n
}
, we use a standard language modeling
objective to maximize the following likelihood:
L
1
(U) =
X
i
log P (u
i
|u
i−k
, . . . , u
i−1
; Θ) (1)
where
k
is the size of the context window, and the conditional probability
P
is modeled using a neural
network with parameters Θ. These parameters are trained using stochastic gradient descent [51].
In our experiments, we use a multi-layer Transformer decoder [
34
] for the language model, which is
a variant of the transformer [
62
]. This model applies a multi-headed self-attention operation over the
input context tokens followed by position-wise feedforward layers to produce an output distribution
over target tokens:
h
0
= UW
e
+ W
p
h
l
= transformer_block(h
l−1
)∀i ∈ [1, n]
P (u) = softmax(h
n
W
T
e
)
(2)
where
U = (u
−k
, . . . , u
−1
)
is the context vector of tokens,
n
is the number of layers,
W
e
is the token
embedding matrix, and W
p
is the position embedding matrix.
3.2 Supervised fine-tuning
After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target
task. We assume a labeled dataset
C
, where each instance consists of a sequence of input tokens,
x
1
, . . . , x
m
, along with a label
y
. The inputs are passed through our pre-trained model to obtain
the final transformer block’s activation
h
m
l
, which is then fed into an added linear output layer with
parameters W
y
to predict y:
P (y|x
1
, . . . , x
m
) = softmax(h
m
l
W
y
). (3)
This gives us the following objective to maximize:
L
2
(C) =
X
(x,y )
log P (y|x
1
, . . . , x
m
). (4)
We additionally found that including language modeling as an auxiliary objective to the fine-tuning
helped learning by (a) improving generalization of the supervised model, and (b) accelerating
convergence. This is in line with prior work [
50
,
43
], who also observed improved performance with
such an auxiliary objective. Specifically, we optimize the following objective (with weight λ):
L
3
(C) = L
2
(C) + λ ∗ L
1
(C) (5)
Overall, the only extra parameters we require during fine-tuning are
W
y
, and embeddings for delimiter
tokens (described below in Section 3.3).
3