首页A Primer on Neural Network Models for Natural Language Processing
A Primer on Neural Network Models
for Natural Language Processing
Draft as of October 6, 2015.
The most up-to-date version of this manuscript is available at http://www.cs.biu.
yogo/nnlp.pdf. Major updates will be published on arxiv periodically.
I welcome any comments you may have regarding the content and presentation. If you
spot a missing reference or have relevant work you’d like to see mentioned, do let me know.
Over the past few years, neural networks have re-emerged as powerful machine-learning
models, yielding state-of-the-art results in ﬁelds such as image recognition and speech
processing. More recently, neural network models started to be applied also to textual
natural language signals, again with very promising results. This tutorial surveys neural
network models from the perspective of natural language processing research, in an attempt
to bring natural-language researchers up to speed with the neural techniques. The tutorial
covers input encoding for natural language tasks, feed-forward networks, convolutional
networks, recurrent networks and recursive networks, as well as the computation graph
abstraction for automatic gradient computation.
For a long time, core NLP techniques were dominated by machine-learning approaches that
used linear models such as support vector machines or logistic regression, trained over very
high dimensional yet very sparse feature vectors.
Recently, the ﬁeld has seen some success in switching from such linear models over
sparse inputs to non-linear neural-network models over dense inputs. While most of the
neural network techniques are easy to apply, sometimes as almost drop-in replacements of
the old linear classiﬁers, there is in many cases a strong barrier of entry. In this tutorial I
attempt to provide NLP practitioners (as well as newcomers) with the basic background,
jargon, tools and methodology that will allow them to understand the principles behind
the neural network models and apply them to their own work. This tutorial is expected
to be self-contained, while presenting the diﬀerent approaches under a uniﬁed notation and
framework. It repeats a lot of material which is available elsewhere. It also points to
external sources for more advanced topics when appropriate.
This primer is not intended as a comprehensive resource for those that will go on and
develop the next advances in neural-network machinery (though it may serve as a good entry
point). Rather, it is aimed at those readers who are interested in taking the existing, useful
technology and applying it in useful and creative ways to their favourite NLP problems. For
more in-depth, general discussion of neural networks, the theory behind them, advanced
arXiv:1510.00726v1 [cs.CL] 2 Oct 2015
optimization methods and other advanced topics, the reader is referred to other existing
resources. In particular, the book by Bengio et al (2015) is highly recommended.
Scope The focus is on applications of neural networks to language processing tasks. How-
ever, some subareas of language processing with neural networks were decidedly left out of
scope of this tutorial. These include the vast literature of language modeling and acoustic
modeling, the use of neural networks for machine translation, and multi-modal applications
combining language and other signals such as images and videos (e.g. caption generation).
Caching methods for eﬃcient runtime performance, methods for eﬃcient training with large
output vocabularies and attention models are also not discussed. Word embeddings are dis-
cussed only to the extent that is needed to understand in order to use them as inputs for
other models. Other unsupervised approaches, including autoencoders and recursive au-
toencoders, also fall out of scope. While some applications of neural networks for language
modeling and machine translation are mentioned in the text, their treatment is by no means
A Note on Terminology The word “feature” is used to refer to a concrete, linguistic
input such as a word, a suﬃx, or a part-of-speech tag. For example, in a ﬁrst-order part-
of-speech tagger, the features might be “current word, previous word, next word, previous
part of speech”. The term “input vector” is used to refer to the actual input that is fed
to the neural-network classiﬁer. Similarly, “input vector entry” refers to a speciﬁc value
of the input. This is in contrast to a lot of the neural networks literature in which the
word “feature” is overloaded between the two uses, and is used primarily to refer to an
Mathematical Notation I use bold upper case letters to represent matrices (X, Y,
Z), and bold lower-case letters to represent vectors (b). When there are series of related
matrices and vectors (for example, where each matrix corresponds to a diﬀerent layer in
the network), superscript indices are used (W
). For the rare cases in which we want
indicate the power of a matrix or a vector, a pair of brackets is added around the item to
be exponentiated: (W)
. Unless otherwise stated, vectors are assumed to be row
vectors. We use [v
] to denote vector concatenation.
2. Neural Network Architectures
Neural networks are powerful learning models. We will discuss two kinds of neural network
architectures, that can be mixed and matched – feed-forward networks and Recurrent /
Recursive networks. Feed-forward networks include networks with fully connected layers,
such as the multi-layer perceptron, as well as networks with convolutional and pooling
layers. All of the networks act as classiﬁers, but each with diﬀerent strengths.
Fully connected feed-forward neural networks (Section 4) are non-linear learners that
can, for the most part, be used as a drop-in replacement wherever a linear learner is used.
This includes binary and multiclass classiﬁcation problems, as well as more complex struc-
tured prediction problems (Section 8). The non-linearity of the network, as well as the
ability to easily integrate pre-trained word embeddings, often lead to superior classiﬁcation
accuracy. A series of works (Chen & Manning, 2014; Weiss, Alberti, Collins, & Petrov,
2015; Pei, Ge, & Chang, 2015; Durrett & Klein, 2015) managed to obtain improved syntac-
tic parsing results by simply replacing the linear model of a parser with a fully connected
feed-forward network. Straight-forward applications of a feed-forward network as a classi-
ﬁer replacement (usually coupled with the use of pre-trained word vectors) provide beneﬁts
also for CCG supertagging (Lewis & Steedman, 2014), dialog state tracking (Henderson,
Thomson, & Young, 2013), pre-ordering for statistical machine translation (de Gispert,
Iglesias, & Byrne, 2015) and language modeling (Bengio, Ducharme, Vincent, & Janvin,
2003; Vaswani, Zhao, Fossum, & Chiang, 2013). Iyyer et al (2015) demonstrate that multi-
layer feed-forward networks can provide competitive results on sentiment classiﬁcation and
factoid question answering.
Networks with convolutional and pooling layers (Section 9) are useful for classiﬁcation
tasks in which we expect to ﬁnd strong local clues regarding class membership, but these
clues can appear in diﬀerent places in the input. For example, in a document classiﬁcation
task, a single key phrase (or an ngram) can help in determining the topic of the document
(Johnson & Zhang, 2015). We would like to learn that certain sequences of words are good
indicators of the topic, and do not necessarily care where they appear in the document.
Convolutional and pooling layers allow the model to learn to ﬁnd such local indicators,
regardless of their position. Convolutional and pooling architecture show promising results
on many tasks, including document classiﬁcation (Johnson & Zhang, 2015), short-text cat-
egorization (Wang, Xu, Xu, Liu, Zhang, Wang, & Hao, 2015a), sentiment classiﬁcation
(Kalchbrenner, Grefenstette, & Blunsom, 2014; Kim, 2014), relation type classiﬁcation be-
tween entities (Zeng, Liu, Lai, Zhou, & Zhao, 2014; dos Santos, Xiang, & Zhou, 2015), event
detection (Chen, Xu, Liu, Zeng, & Zhao, 2015; Nguyen & Grishman, 2015), paraphrase iden-
tiﬁcation (Yin & Sch¨utze, 2015) semantic role labeling (Collobert, Weston, Bottou, Karlen,
Kavukcuoglu, & Kuksa, 2011), question answering (Dong, Wei, Zhou, & Xu, 2015), predict-
ing box-oﬃce revenues of movies based on critic reviews (Bitvai & Cohn, 2015) modeling
text interestingness (Gao, Pantel, Gamon, He, & Deng, 2014), and modeling the relation
between character-sequences and part-of-speech tags (Santos & Zadrozny, 2014).
In natural language we often work with structured data of arbitrary sizes, such as
sequences and trees. We would like to be able to capture regularities in such structures,
or to model similarities between such structures. In many cases, this means encoding
the structure as a ﬁxed width vector, which we can then pass on to another statistical
learner for further processing. While convolutional and pooling architectures allow us to
encode arbitrary large items as ﬁxed size vectors capturing their most salient features,
they do so by sacriﬁcing most of the structural information. Recurrent (Section 10) and
recursive (Section 12) architectures, on the other hand, allow us to work with sequences
and trees while preserving a lot of the structural information. Recurrent networks (Elman,
1990) are designed to model sequences, while recursive networks (Goller & K¨uchler, 1996)
are generalizations of recurrent networks that can handle trees. We will also discuss an
extension of recurrent networks that allow them to model stacks (Dyer, Ballesteros, Ling,
Matthews, & Smith, 2015; Watanabe & Sumita, 2015).
Recurrent models have been shown to produce very strong results for language model-
ing, including (Mikolov, Karaﬁ´at, Burget, Cernocky, & Khudanpur, 2010; Mikolov, Kom-
brink, Luk´aˇs Burget,
Cernocky, & Khudanpur, 2011; Mikolov, 2012; Duh, Neubig, Sudoh,
& Tsukada, 2013; Adel, Vu, & Schultz, 2013; Auli, Galley, Quirk, & Zweig, 2013; Auli &
Gao, 2014); as well as for sequence tagging (Irsoy & Cardie, 2014; Xu, Auli, & Clark, 2015;
Ling, Dyer, Black, Trancoso, Fermandez, Amir, Marujo, & Luis, 2015b), machine transla-
tion (Sundermeyer, Alkhouli, Wuebker, & Ney, 2014; Tamura, Watanabe, & Sumita, 2014;
Sutskever, Vinyals, & Le, 2014; Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares,
Schwenk, & Bengio, 2014b), dependency parsing (Dyer et al., 2015; Watanabe & Sumita,
2015), sentiment analysis (Wang, Liu, SUN, Wang, & Wang, 2015b), noisy text normal-
ization (Chrupala, 2014), dialog state tracking (Mrkˇsi´c,
O S´eaghdha, Thomson, Gasic, Su,
Vandyke, Wen, & Young, 2015), response generation (Sordoni, Galley, Auli, Brockett, Ji,
Mitchell, Nie, Gao, & Dolan, 2015), and modeling the relation between character sequences
and part-of-speech tags (Ling et al., 2015b).
Recursive models were shown to produce state-of-the-art or near state-of-the-art re-
sults for constituency (Socher, Bauer, Manning, & Andrew Y., 2013) and dependency (Le
& Zuidema, 2014; Zhu, Qiu, Chen, & Huang, 2015a) parse re-ranking, discourse parsing
(Li, Li, & Hovy, 2014), semantic relation classiﬁcation (Hashimoto, Miwa, Tsuruoka, &
Chikayama, 2013; Liu, Wei, Li, Ji, Zhou, & WANG, 2015), political ideology detection
based on parse trees (Iyyer, Enns, Boyd-Graber, & Resnik, 2014b), sentiment classiﬁcation
(Socher, Perelygin, Wu, Chuang, Manning, Ng, & Potts, 2013; Hermann & Blunsom, 2013),
target-dependent sentiment classiﬁcation (Dong, Wei, Tan, Tang, Zhou, & Xu, 2014) and
question answering (Iyyer, Boyd-Graber, Claudino, Socher, & Daum´e III, 2014a).
3. Feature Representation
Before discussing the network structure in more depth, it is important to pay attention
to how features are represented. For now, we can think of a feed-forward neural network
as a function NN(x) that takes as input a d
dimensional vector x and produces a d
dimensional output vector. The function is often used as a classiﬁer, assigning the input
x a degree of membership in one or more of d
classes. The function can be complex,
and is almost always non-linear. Common structures of this function will be discussed
in Section 4. Here, we focus on the input, x. When dealing with natural language, the
input x encodes features such as words, part-of-speech tags or other linguistic information.
Perhaps the biggest jump when moving from sparse-input linear models to neural-network
based models is to stop representing each feature as a unique dimension (the so called
one-hot representation) and representing them instead as dense vectors. That is, each core
feature is embedded into a d dimensional space, and represented as a vector in that space.
The embeddings (the vector representation of each core feature) can then be trained like
the other parameter of the function NN. Figure 1 shows the two approaches to feature
The feature embeddings (the values of the vector entries for each feature) are treated
as model parameters that need to be trained together with the other components of the
network. Methods of training (or obtaining) the feature embeddings will be discussed later.
For now, consider the feature embeddings as given.
The general structure for an NLP classiﬁcation system based on a feed-forward neural
network is thus:
1. Extract a set of core linguistic features f
, . . . , f
that are relevant for predicting the
2. For each feature f
of interest, retrieve the corresponding vector v(f
3. Combine the vectors (either by concatenation, summation or a combination of both)
into an input vector x.
4. Feed x into a non-linear classiﬁer (feed-forward neural network).
The biggest change in the input, then, is the move from sparse representations in which
each feature is its own dimension, to a dense representation in which each feature is mapped
to a vector. Another diﬀerence is that we extract only core features and not feature com-
binations. We will elaborate on both these changes brieﬂy.
Dense Vectors vs. One-hot Representations What are the beneﬁts of representing
our features as vectors instead of as unique IDs? Should we always represent features as
dense vectors? Let’s consider the two kinds of representations:
One Hot Each feature is its own dimension.
• Dimensionality of one-hot vector is same as number of distinct features.
1. Diﬀerent feature types may be embedded into diﬀerent spaces. For example, one may represent word
features using 100 dimensions, and part-of-speech features using 20 dimensions.
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额