1 A Joint Introduction to Natural Language Processing and to Deep Learning 5
computer vision researchers immediately realized the limitation of the knowledge-
based paradigm due to the necessity for machine learning with uncertainty handling
and generalization capabilities.
The empiricism in NLP and speech recognition in this second wave was based
on data-intensive machine learning, which we now call “shallow” due to the general
lack of abstractions constructed by many-layer or “deep” representations of data
which would come in the third wave to be described in the next section. In machine
learning, researchers do not need to concern with constructing precise and exact rules
as required for the knowledge-based NLP and speech systems during the first wave.
Rather, they focus on statistical models (Bishop 2006; Murphy 2012) or simple neural
networks (Bishop 1995) as an underlying engine. They then automatically learn or
“tune” the parameters of the engine using ample training data to make them handle
uncertainty, and to attempt to generalize from one condition to another and from one
domain to another. The key algorithms and methods for machine learning include EM
(expectation-maximization), Bayesian networks, support vector machines, decision
trees, and, for neural networks, backpropagation algorithm.
Generally speaking, the machine learning based NLP, speech, and other artificial
intelligence systems perform much better than the earlier, knowledge-based counter-
parts. Successful examples include almost all artificial intelligence tasks in machine
perception—speech recognition (Jelinek 1998), face recognition (Viola and Jones
2004), visual object recognition (Fei-Fei and Perona 2005), handwriting recognition
(Plamondon and Srihari 2000), and machine translation (Och 2003).
More specifically, in a core NLP application area of machine translation, as to be
described in detail in Chap. 6 of this book as well as in (Church and Mercer 1993), the
field has switched rather abruptly around 1990 from rationalistic methods outlined in
Sect. 1.2 to empirical, largely statistical methods. The availability of sentence-level
alignments in the bilingual training data made it possible to acquire surface-level
translation knowledge not by rules but from data directly, at the expense of discarding
or discounting structured information in natural languages. The most representative
work during this wave is that empowered by various versions of IBM translation
models (Brown et al. 1993). Subsequent developments during this empiricist era of
machine translation further significantly improved the quality of translation systems
(Och and Ney 2002;Och2003; Chiang 2007; He and Deng 2012), but not at the
level of massive deployment in real world (which would come after the next, deep
learning wave).
In the dialogue and spoken language understanding areas of NLP, this empiri-
cist era was also marked prominently by data-driven machine learning approaches.
These approaches were well suited to meet the requirement for quantitative evalua-
tion and concrete deliverables. They focused on broader but shallow, surface-level
coverage of text and domains instead of detailed analyses of highly restricted text
and domains. The training data were used not to design rules for language under-
standing and response action from the dialogue systems but to learn parameters of
(shallow) statistical or neural models automatically from data. Such learning helped
reduce the cost of hand-crafted complex dialogue manager’s design, and helped
improve robustness against speech recognition errors in the overall spoken language