18 CHAPTER 1. INTRODUCTION
representations such as syntax trees have not yet gone the way of the visual edge detector488
or the auditory triphone. Linguists have argued for the existence of a “language faculty”489
in all human beings, which encodes a set of abstractions specially designed to facilitate490
the understanding and production of language. The argument for the existence of such491
a language faculty is based on the observation that children learn language faster and492
from fewer examples than would be reasonably possible, if language was learned from493
experience alone.
3
Regardless of the cognitive validity of these arguments, it seems that494
linguistic structures are particularly important in scenarios where training data is limited.495
Moving away from the extreme ends of the continuum, there are a number of ways in496
which knowledge and learning can be combined in natural language processing. Many497
supervised learning systems make use of carefully engineered features, which transform498
the data into a representation that can facilitate learning. For example, in a task like doc-499
ument classification, it may be useful to identify each word’s stem, so that a learning500
system can more easily generalize across related terms such as whale, whales, whalers, and501
whaling. This is particularly important in the many languages that exceed English in the502
complexity of the system of affixes that can attach to words. Such features could be ob-503
tained from a hand-crafted resource, like a dictionary that maps each word to a single504
root form. Alternatively, features can be obtained from the output of a general-purpose505
language processing system, such as a parser or part-of-speech tagger, which may itself506
be built on supervised machine learning.507
Another synthesis of learning and knowledge is in model structure: building machine508
learning models whose architectures are inspired by linguistic theories. For example, the509
organization of sentences is often described as compositional, with meaning of larger510
units gradually constructed from the meaning of their smaller constituents. This idea511
can be built into the architecture of a deep neural network, which is then trained using512
contemporary deep learning techniques (Dyer et al., 2016).513
The debate about the relative importance of machine learning and linguistic knowl-514
edge sometimes becomes heated. No machine learning specialist likes to be told that their515
engineering methodology is unscientific alchemy;
4
nor does a linguist want to hear that516
the search for general linguistic principles and structures has been made irrelevant by big517
data. Yet there is clearly room for both types of research: we need to know how far we518
can go with end-to-end learning alone, while at the same time, we continue the search for519
linguistic representations that generalize across applications, scenarios, and languages.520
For more on the history of this debate, see Church (2011); for an optimistic view of the521
potential symbiosis between computational linguistics and deep learning, see Manning522
3
The Language Instinct (Pinker, 2003) articulates these arguments in an engaging and popular style. For
arguments against the innateness of language, see Elman et al. (1998).
4
Ali Rahimi argued that much of deep learning research was similar to “alchemy” in a presentation at
the 2017 conference on Neural Information Processing Systems. He was advocating for more learning theory,
not more linguistics.
(c) Jacob Eisenstein 2018. Draft of June 3, 2018.