608 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 2, FEBRUARY 2021
On the PTB, the results were produced on par with the
existing state of the art [62]. However, the network had only
19 million trainable parameters, which is considerably lower
than others. Since the network focused on morphological
similarities produced by character-level analysis, it was more
capable than previous models of handling rare words. Analysis
showed that without the use of highway layers, many words
had nearest neighbors that were orthographically similar but
not necessarily semantically similar. In addition, the network
was capable of recognizing misspelled words or words not
spelled in the standard way (e.g., looooook instead of look)
and of recognizing out of vocabulary words. The analysis also
showed that the network was capable of identifying prefixes,
roots, and suffixes, as well as understanding hyphenated words,
making it a robust model.
Jozefowicz et al. [63] tested a number of architectures
producing character-level outputs [55], [64]–[66]. While many
of these models had only b een tested on small-scale language
modeling, th is study tested them on a large scale, testing them
with the Billion Word Benchm ark. The most effective m odel,
achieving a state-of-the-art (for single models) perplexity
of 30.0 with 1.04 billion trainable parameters (compared to
a previous best by a single model of 51.3 with 20 billion
parameters [55]), was a large LSTM using a character-level
CNN as an input network. The b est performance, however,
was achieved using an ensemble of ten LSTMs. This ensemble,
with a perplexity of 23.7, far surpassed the previous state-of-
the-art ensemble [65], which had a perplexity of 41.0.
6) Development of Word Embeddings: Notonlydoneural
language models allow for the prediction of unseen synony-
mous words, but also they allow for modeling the relationships
between the words [67], [68]. Vectors with numeric compo-
nents, representing individual words, obtained by language
modeling techniques are called embeddings. This is usually
done either by the use of principle component analysis or by
capturing internal states in a neural language model. (Note
that these are not standard language modelings, but rather are
language modelings constructed specifically for this purpose.)
Typically, word embeddings have between 50 and 300 dimen-
sions. An overused example is that of the distributed represen-
tations of the words king, queen, man,andwoman. If one takes
the embedding vectors f or each of these words, computation
can be performed to obtain highly sensible results. If the
vectors representing these words are, r espectively, represented
as
k, q, m,and w, it can be observed that
k −q ≈m −w,
which is extremely intuitive to human reasoning. In recent
years, word embeddings have been the standard form of input
to NLP systems.
7) Recent Advances and Challenges: Language model-
ing has been evolving on a weekly basis, beginning with
the works of Radford et al. [69] and Peters et al. [70].
Radford et al. [69] introduced generative pretraining (GPT)
which pretrained a language model based on the transformer
model [42] (Section IV-G), learning dependencies of words
in sentences and longer segments of text, rather than just
the immediately surrounding words. Peters et al. [70] incorpo-
rated bidirectionalism to capture backward context in addition
to the forward context, in their Embeddings from Language
Models (ELMo). In addition, they captured the vectorizations
at multiple levels, rather than just the final layer. This allowed
for multiple encodings of the same information to be captured,
which was empirically shown to significantly boost the per-
formance.
Devlin et al. [71] added the additional unsupervised training
tasks of random masked neighbor word prediction and next-
sentence-prediction (NSP), in which given a sentence (or o ther
continuous segment of text), another sentence was predicted
to either be the next sentence or not. These Bidirectional
Encoder Representations from Transformers (BERT) were
further built upon by Liu et al. [72] to create multitask DNN
(MT-DNN) rep r esentations, which are the cu rrent state o f the
art in language modeling. The model used a stochastic answer
network (SAN) [73], [74] ontop of a BERT-like model. After
pretraining, the model was trained on a number of different
tasks before being fine-tuned to the task at hand. Using
MT-DNN as the language modeling, they achieved state-of-
the-art results on ten out of eleven of the attempted tasks.
While these pretrained models have made excellent h ead-
way in “understanding” language, as is required for some tasks
such as entailment inference, it has been hypothesized by some
that these models are learning templates or syntactic patterns
present within the data sets, unrelated to logic or inference.
When new data sets are created by removing such patterns
carefully, the models do not perform well [75]. In addition,
while there has been recent work on cross-language modeling
and universal language modeling, the amount and level of
work need to pick up to address low-resource languages.
B. Morphology
Morphology is concerned with finding segments within
single words, including roots and stems, prefixes, suffixes,
and—in some languages—infixes. Affixes (prefixes, suffixes,
and infixes) are used to overtly modify stems for gender,
number, person, and so on.
Luong et al. [76] constructed a morphologically aware lan-
guage modeling. An RvNN was used to model the morpho-
logical structure. A neural language model was then placed
on top of the RvNN. The model was trained on the WordSim-
353 data set [77], and segmentation was performed using Mor-
fessor [78]. Two models were constructed—one using context
and one not. It was found that the model that was insensitive
to context overaccounted for certain morphological structures.
In particular, words with the same stem were clustered together
even if they were antonyms. The context-sensitive model
performed better, noting the relationships between the stems
but also accounting for other features such as the prefix “un.”
The m odel was also tested on several other popular data
sets [79]–[81], significantly outperforming previous embed-
ding models on all.
A good morphological analyzer is often important for many
NLP tasks. As such, one recent study by Belinkov et al. [82]
examined the extent to which morphology was learned and
used by a variety of neural machine translation (NMT)
models. A number of translation models were constructed,
all translating from English to French, German, Czech,
Arabic, or Hebrew. Encoders and decoders were LSTM-based
models (some with attentio n mechanisms) or character
aware CNNs, and the models were trained on the WIT
3
corpus [83], [84]. The decoders were then replaced with POS
taggers and morphological taggers, fixing the weights of the
encoders to preserve the internal representations. The effects
of the encoders were examined as were the effects of the
Authorized licensed use limited to: Univ of Science and Tech Beijing. Downloaded on March 15,2021 at 09:05:41 UTC from IEEE Xplore. Restrictions apply.