Convolutional Neural Networks for Sentence Classification
Yoon Kim
New York University
yhk255@nyu.edu
Abstract
We report on a series of experiments with
convolutional neural networks (CNN)
trained on top of pre-trained word vec-
tors for sentence-level classification tasks.
We show that a simple CNN with lit-
tle hyperparameter tuning and static vec-
tors achieves excellent results on multi-
ple benchmarks. Learning task-specific
vectors through fine-tuning offers further
gains in performance. We additionally
propose a simple modification to the ar-
chitecture to allow for the use of both
task-specific and static vectors. The CNN
models discussed herein improve upon the
state of the art on 4 out of 7 tasks, which
include sentiment analysis and question
classification.
1 Introduction
Deep learning models have achieved remarkable
results in computer vision (Krizhevsky et al.,
2012) and speech recognition (Graves et al., 2013)
in recent years. Within natural language process-
ing, much of the work with deep learning meth-
ods has involved learning word vector representa-
tions through neural language models (Bengio et
al., 2003; Yih et al., 2011; Mikolov et al., 2013)
and performing composition over the learned word
vectors for classification (Collobert et al., 2011).
Word vectors, wherein words are projected from a
sparse, 1-of-V encoding (here V is the vocabulary
size) onto a lower dimensional vector space via a
hidden layer, are essentially feature extractors that
encode semantic features of words in their dimen-
sions. In such dense representations, semantically
close words are likewise close—in euclidean or
cosine distance—in the lower dimensional vector
space.
Convolutional neural networks (CNN) utilize
layers with convolving filters that are applied to
local features (LeCun et al., 1998). Originally
invented for computer vision, CNN models have
subsequently been shown to be effective for NLP
and have achieved excellent results in semantic
parsing (Yih et al., 2014), search query retrieval
(Shen et al., 2014), sentence modeling (Kalch-
brenner et al., 2014), and other traditional NLP
tasks (Collobert et al., 2011).
In the present work, we train a simple CNN with
one layer of convolution on top of word vectors
obtained from an unsupervised neural language
model. These vectors were trained by Mikolov et
al. (2013) on 100 billion words of Google News,
and are publicly available.
1
We initially keep the
word vectors static and learn only the other param-
eters of the model. Despite little tuning of hyper-
parameters, this simple model achieves excellent
results on multiple benchmarks, suggesting that
the pre-trained vectors are ‘universal’ feature ex-
tractors that can be utilized for various classifica-
tion tasks. Learning task-specific vectors through
fine-tuning results in further improvements. We
finally describe a simple modification to the archi-
tecture to allow for the use of both pre-trained and
task-specific vectors by having multiple channels.
Our work is philosophically similar to Razavian
et al. (2014) which showed that for image clas-
sification, feature extractors obtained from a pre-
trained deep learning model perform well on a va-
riety of tasks—including tasks that are very dif-
ferent from the original task for which the feature
extractors were trained.
2 Model
The model architecture, shown in figure 1, is a
slight variant of the CNN architecture of Collobert
et al. (2011). Let x
i
∈ R
k
be the k-dimensional
word vector corresponding to the i-th word in the
sentence. A sentence of length n (padded where
1
https://code.google.com/p/word2vec/
arXiv:1408.5882v2 [cs.CL] 3 Sep 2014