2 1 Introduction
All these models can be considered as Artificial Intelligence (AI) Systems. AI
is a broad research field aimed at creating i ntelligent machines, acting similar
to humans and animals having natural intelligence. It captures the field’s long-
term goal of building machines that mimic and then surpass the full spectrum of
human cognition. Machine Learning (ML) is a subfield of artificial intelligence
that employs statistical techniques to give machines the capability to ‘learn’ from
data without being given explicit instructions on what to do. This process is also
called ‘training’, whereby a ‘learning algorithm’ gradually improves the model’s
performance on a given task. Deep Learning is an area of ML in which an input
is transformed in layers step by step in such a way that complex patterns in the
data can be recognized. The adjective ‘deep’ refers to the large number of layers in
modern ML models that help to learn expressive representations of data to achieve
better performance.
In contrast to computer vision, the size of annotated training data for NLP
applications was rather small, comprising only a few thousand sentences (except
for machine translation). The main reason for this was the high cost of manual
annotation. To avoid overfitting, i.e. overadapting models to random fluctuations,
only relatively small models could be trained, which did not yield high performance.
In the last 5 years, new NLP methods have been developed based on the Transformer
introduced by Vaswani et al. [
67]. They represent the meaning of each word by a
vector of real numbers called embedding. Between these embeddings various kinds
of “attentions” can be computed, which can be considered as a sort of “correlation”
between different words. In higher layers of the network, attention computations are
used to generate new embeddings that can capture subtle nuances in the meaning
of words. In particular, they can grasp different meanings of the same word that
arise from context. A key advantage of these models is that they can be trained
with unannotated text, which is almost infinitely available, and overfitting is not a
problem.
Currently, there is a rapid development of new methods in the research field,
which makes many approaches from earlier years obsolete. These models are
usually trained in two steps: In a first pre-training step, they are trained on a large
text corpus containing billions of words without any annotations. A typical pre-
training task is to predict single words in the text that have been masked in the
input. In this way, the model learns fine subtleties of natural language syntax and
semantics. Because enough data is available, the models can be extended to many
layers with millions or billions of parameters.
In a second fine-tuning step, the model is trained on a small annotated training
set. In this way, the model can be adapted to new specific tasks. Since the fine-
tuning data is very small compared to the pre-training data and the model has a
high capacity with many millions of parameters, it can be adapted to the fine-
tuning task without losing the stored information about the language structure.
It was demonstrated that this idea can be applied to most NLP tasks, leading to
unprecedented performance gains in semantic understanding. This transfer learning
allows knowledge from the pre-training phase to be transferred to the fine-tuned
model. These models are referred to as Pre-trained Language Models (PLM).