118 Computer Science & Information Technology (CS & IT)
Word2Vec, as well as other similar models (e.g. Glove [5]) are considered algorithms for
textual representations context-less. This means that these models present restrictions regarding
the representation of the context of words in a text, impairing tasks at the sentence level or even
fine-tuning at the word level. This is because these models are unidirectional, that is, they
consider the context of a word only from left to right, with no mechanism that detects if a
particular word has already occurred in the corpus before. Therefore, these models provide a
single representation, using a dense vector, for each word in a text or set of texts.
In addition, according to its authors [4] these models are considered very shallow, as they
represent each word in only one layer, and there is a limit on the amount of information they can
capture. Finally, these models do not consider the polysemy of words, that is, the same word
being used in different contexts can have different meanings (e.g. bank – monetary sense; bank -
to sit), which is not treated by these models. Another characteristic that is not treated is the
ambiguity of the words, that is, when two or more different words have the same meaning (e.g.
create, implement, generate).
On the other hand, many advances have occurred in the area of NLP in recent years. Such
advances are due, mainly, to deep learning techniques [23]. Among these advances is the
possibility of obtaining contextualized embedding. This approach produces different vector
representations for the same word in a text, which varies according to its context. Therefore,
these techniques are capable of capturing contextual semantics of ambiguous words [14], as
well as addressing polysemy issues. From this new paradigm, recent studies have turned to
research that applies contextualized embedding models [24] [14], leaving aside the original
paradigm, in which there was only one vector of embedding for each single word in one text/set
of texts. Thus, each occurrence of a word is mapped to a dense vector, specifically considering
the surrounding context.
This representation approach is easily applicable to many NLP tasks, where the inputs are
usually sentences and therefore, the context information is available, such as textual software
requirements. This new language representation paradigm originated from several ideas and
initiatives that emerged in NLP in recent years, such as: coVe [25], ELMo [24], ULMFiT [21],
CVT [26], Context2Vec [10], BERT [19] and Transformer OpenAI (GPT e GPT-2) [27]. The
BERT contextualized pre-trained model [19], has presented results greatly improved in NLP
tasks, and has therefore been widely used in several applications. Its application has occurred
through pre-trained models and available by its authors (e.g. BERT_base e BERT large [19]).
2.2. BERT
The BERT is an innovative method, considered the state of the art in pre-trained language
representation [19]. BERT models are considered contextualized or dynamic models, and have
shown much-improved results in several NLP tasks [22], [24], [27], [21] as sentiment
classification, calculation of semantic tasks of textual similarity and recognition of tasks of
textual linking.
This model originated from various ideas and initiatives aimed at textual representation that
have emerged in the area of NLP in recent years, such as: coVe [25], ELMo [24], ULMFiT [21],
CVT [26], context2Vec [28], the OpenAI transformer (GPT and GPT-2) [27] and the
Transformer [29]. BERT is characterized as a dynamic method, mainly because it has an
attention mechanism, also called Transformer [19], which allows analyzing the context of each
word in a text individually, including checking if each word has been previously used in a text
with the same context. This allows the method to learn contextual relationships between words
(or subwords) in a text. BERT consists of several Transformer models [29] whose parameters
are pre-trained on an unlabeled corpus like Wikipedia and BooksCorpus [30]. It can say that for
a given input sentence, BERT “looks left and right several times” and outputs a dense vector