provements for semantic compositionality include
matrix-vector interaction (Socher et al., 2012),
tensor interaction (Socher et al., 2013). They are
more suitable for capturing logical information in
sentences, such as negation and exclamation.
One potential problem of RNNs is that the long
propagation paths—through which leaf nodes are
connected to the output layer—may lead to infor-
mation loss. Thus, RNNs bury illuminating in-
formation under a complicated neural architecture.
Further, during back-propagation over a long path,
gradients tend to vanish (or blow up), which makes
training difficult (Erhan et al., 2009). Long short
term memory (LSTM), first proposed for model-
ing time-series data (Hochreiter and Schmidhuber,
1997), is integrated to RNNs to alleviate this prob-
lem (Tai et al., 2015; Le and Zuidema, 2015; Zhu
et al., 2015).
Recurrent networks. A variant class of RNNs
is the recurrent neural network (Bengio et al.,
1994; Shang et al., 2015), whose architecture is
a rightmost tree. In such models, meaningful tree
structures are also lost, similar to CNNs.
3 Tree-based Convolution
This section introduces the proposed tree-based
convolutional neural networks (TBCNNs). Figure
1c depicts the convolution process on a tree.
First, a sentence is converted to a parse tree, ei-
ther a constituency or dependency tree. The corre-
sponding model variants are denoted as c-TBCNN
and d-TBCNN. Each node in the tree is repre-
sented as a distributed, real-valued vector.
Then, we design a set of fixed-depth subtree fea-
ture detectors, called the tree-based convolution
window. The window slides over the entire tree
to extract structural information of the sentence,
illustrated by a dashed triangle in Figure 1c. For-
mally, let us assume we have t nodes in the con-
volution window, x
1
, · · · , x
t
, each represented as
an n
e
-dimensional vector. Let n
c
be the number
of feature detectors. The output of the tree-based
convolution window, evaluated at the current sub-
tree, is given by the following generic equation.
y = f
t
X
i=1
W
i
·x
i
+ b
!
(2)
where W
i
∈ R
n
c
×n
e
is the weight parameter asso-
ciated with node x
i
; b ∈ R
n
c
is the bias term.
Extracted features are thereafter packed into
one or more fixed-size vectors by max pooling,
that is, the maximum value in each dimension is
taken. Finally, we add a fully connected hidden
layer, and a softmax output layer.
From the designed architecture (Figure 1c), we
see that our TBCNN models allow short propaga-
tion paths between the output layer and any posi-
tion in the tree. Therefore structural feature learn-
ing becomes effective.
Several main technical points in tree-based con-
volution include: (1) How can we represent hid-
den nodes as vectors in constituency trees? (2)
How can we determine weights, W
i
, for depen-
dency trees, where nodes may have different num-
bers of children? (3) How can we pool varying
sized and shaped features to fixed-size vectors?
In the rest of this section, we explain model
variants in detail. Particularly, Subsections 3.1 and
3.2 address the first and second problems; Sub-
section 3.3 deals with the third problem by intro-
ducing several pooling heuristics. Subsection 3.4
presents our training objective.
3.1 c-TBCNN
Figure 2a illustrates an example of the con-
stituency tree, where leaf nodes are words in the
sentence, and non-leaf nodes represent a grammat-
ical constituent, e.g., a noun phrase. Sentences
are parsed by the Stanford parser;
3
further, con-
stituency trees are binarized for simplicity.
One problem of constituency trees is that non-
leaf nodes do not have such vector representations
as word embeddings. Our strategy is to pretrain
the constituency tree with an RNN by Equation 1
(Socher et al., 2011b). After pretraining, vector
representations of nodes are fixed.
We now consider the tree-based convolution
process in c-TBCNN with a two-layer-subtree
convolution window, which operates on a parent
node p and its direct children c
l
and c
r
, their vec-
tor representations denoted as p, c
l
, and c
r
. The
convolution equation, specific for c-TBCNN, is
y = f
W
(c)
p
·p + W
(c)
l
·c
l
+ W
(c)
r
·c
r
+ b
(c)
where W
(c)
p
, W
(c)
l
, and W
(c)
r
are weights asso-
ciated with the parent and its child nodes. Su-
perscript (c) indicates that the weights are for c-
TBCNN. For leaf nodes, which do not have chil-
dren, we set c
l
and c
r
to be 0.
3
http://nlp.stanford.edu/software/lex-parser.shtml