where the sentence representation capability is not
effectively developed (Chang et al., 2020). Thus, it
may call for a great deal of labeled data (Nguyen
et al., 2016; Kwiatkowski et al., 2019) and sophis-
ticated fine-tuning methods (Xiong et al., 2020;
Qu et al., 2020) to ensure the pre-trained models’
performance for dense retrieval.
To mitigate the above problem, recent works
propose retrieval oriented pre-trained models. The
existing methods can be divided as the ones based
on self-contrastive learning (SCL) and the ones
based on auto-encoding (AE). The SCL based
methods (Chang et al., 2020; Guu et al., 2020;
Xu et al., 2022) rely on data augmentation, e.g.,
inverse cloze task (ICT), where positive samples
are generated for each anchor sentence. Then, the
language model is learned to discriminate the posi-
tive samples from the negative ones via contrastive
learning. However, the self-contrastive learning
usually calls for huge amounts of negative sam-
ples, which is computationally expensive. Besides,
the pre-training effect can be severely restricted by
the quality of data augmentation. The AE based
methods are free from these restrictions, where
the language models are learned to reconstruct the
input sentence based on the sentence embedding.
The existing methods utilize various reconstruction
tasks, such as MLM (Gao and Callan, 2021) and
auto-regression (Lu et al., 2021; Wang et al., 2021;
Li et al., 2020), which are highly differentiated in
terms of how the original sentence is recovered and
how the training loss is formulated. For example,
the auto-regression relies on the sentence embed-
ding and prefix for reconstruction; while MLM
utilizes the sentence embedding and masked con-
text. The auto-regression derives its training loss
from the entire input tokens; however, the conven-
tional MLM only learns from the masked positions,
which accounts for 15% of the input tokens. Ideally,
we expect the decoding operation to be demanding
enough, as it will force the encoder to fully capture
the semantics about the input so as to ensure the
reconstruction quality. Besides, we also look for-
ward to high data efficiency, which means the input
data can be fully utilized for the pre-training task.
3 Methodology
We develop a novel masked auto-encoder for re-
trieval oriented pre-training. The model contains
two modules: a BERT-like encoder
Φ
enc
(·)
to gen-
erate sentence embedding, and a one-layer trans-
former based decoder
Φ
dec
(·)
for sentence recon-
struction. The original sentence
X
is masked as
˜
X
enc
and encoded as the sentence embedding
h
˜
X
.
The sentence is masked again (with a different
mask) as
˜
X
dec
; together with
h
˜
X
, the original sen-
tence
X
is reconstructed. Detailed elaborations
about RetroMAE are made as follows.
3.1 Encoding
The input sentence
X
is polluted as
˜
X
enc
for the
encoding stage, where a small fraction of its tokens
are randomly replaced by the special token [M]
(Figure 2. A). We apply a moderate masking ratio
(15
∼
30%), which means the majority of informa-
tion about the input will be preserved. Then, the
encoder
Φ
enc
(·)
is used to transform the polluted
input as the sentence embedding h
˜
X
:
h
˜
X
← Φ
enc
(
˜
X
enc
). (1)
We apply a BERT like encoder with 12 layers and
768 hidden-dimensions, which helps to capture the
in-depth semantics of the sentence. Following the
common practice, we select the [CLS] token’s final
hidden state as the sentence embedding.
3.2 Decoding
The input sentence
X
is polluted as
˜
X
dec
for the
decoding stage (Figure 2. B). The masking ra-
tio is more aggressive than the one used by the
encoder, where 50
∼
70% of the input tokens will
be masked. The masked input is joined with the
sentence embedding, based on which the original
sentence is reconstructed by the decoder. Particu-
larly, the sentence embedding and the masked input
are combined into the following sequence:
H
˜
X
dec
← [h
˜
X
, e
x
1
+ p
1
, ..., e
x
N
+ p
N
]. (2)
In the above equation,
e
x
i
denotes the embedding
of
x
i
, to which an extra position embedding
p
i
is added. Finally, the decoder
Φ
dec
is learned to
reconstruct the original sentence
X
by optimizing
the following objective:
L
dec
=
X
x
i
∈masked
CE(x
i
|Φ
dec
(H
˜
X
dec
)), (3)
where
CE
is the cross-entropy loss. As men-
tioned, we use a one-layer transformer based de-
coder. Given the aggressively masked input and
the extremely simplified network, the decoding be-
comes challenging, which forces the generation of
high-quality sentence embedding so that the origi-
nal input can be recovered with good fidelity.