![](https://csdnimg.cn/release/download_crawler_static/87805702/bg8.jpg)
A PREPRINT - FEBRUARY 17, 2023
2 The Transformers catalog
Note:
For all the models available in Huggingface, I decided to directly link to the page in the documentation since
they do a fantastic job of offering a consistent format and links to everything else you might need, including the original
papers. Only a few of the models are not included in Huggingface. For those, I try to include a link to their github if
available or blog post if not. For all, I also include bibliographic reference.
2.1 Features of a Transformer
So hopefully by now you understand what Transformer models are, and why they are so popular and impactful. In this
section I will introduce a catalog of the most important Transformer models that have been developed to this day. I will
categorize each model according to the following properties: Pretraining Architecture, Pretraining Task, Compression,
Application, Year, and Number of Parameters. Let’s briefly define each of those:
2.1.1 Pretraining Architecture
We described the Transformer architecture as being made up of an Encoder and a Decoder, and that is true for the
original Transformer. However, since then, different advances have been made that have revealed that in some cases it
is beneficial to use only the encoder, only the decoder, or both.
Encoder Pretraining
These models, which are also called bi-directional or auto-encoding, only use the encoder
during pretraining, which is usually accomplished by masking words in the input sentence and training the model to
reconstruct. At each stage during pretraining, attention layers can access all the input words. This family of models
are most useful for tasks that require understanding complete sentences such as sentence classification or extractive
question answering.
Decoder Pretraining
Decoder models, often called auto-regressive, use only the decoder during a pretraining that
is usually designed so the model is forced to predict the next word. The attention layers can only access the words
positioned before a given word in the sentence. They are best suited for tasks involving text generation.
Transformer (Encoder-Decoder) Pretraining
Encoder-decoder models, also called sequence-to-sequence, use both
parts of the Transformer architecture. Attention layers of the encoder can access all the words in the input, while those
of the decoder can only access the words positioned before a given word in the input. The pretraining can be done using
the objectives of encoder or decoder models, but usually involves something a bit more complex. These models are
best suited for tasks revolving around generating new sentences depending on a given input, such as summarization,
translation, or generative question answering.
2.1.2 Pretraining Task
When training a model we need to define a task for the model to learn on. Some of the typical tasks, such as predicting
the next word or learning to reconstruct masked words were already mentioned above. “Pre-trained Models for Natural
Language Processing: A Survey”[
10
] includes a pretty comprehensive taxonomy of pretraining tasks, all of which can
be considered self-supervised:
1. Language Modeling (LM):
Predict next token (in the case of unidirectional LM) or previous and next token
(in the case of bidirectional LM)
2. Masked Language Modeling (MLM):
mask out some tokens from the input sentences and then trains the
model to predict the masked tokens by the rest of the tokens
3. Permuted Language Modeling (PLM):
same as LM but on a random permutation of input sequences. A
permutation is randomly sampled from all possible permutations. Then some of the tokens are chosen as the
target, and the model is trained to predict these targets.
4. Denoising Autoencoder (DAE):
take a partially corrupted input (e.g. Randomly sampling tokens from the
input and replacing them with "[MASK]" elements. randomly deleting tokens from the input, or shuffling
sentences in random order) and aim to recover the original undistorted input.
5. Contrastive Learning (CTL):
A score function for text pairs is learned by assuming some observed pairs of
text that are more semantically similar than randomly sampled text. It includes:
• Deep InfoMax (DIM):
maximize mutual information between an image representation and local regions
of the image;
8