First-order
[wp]
h
, [wp]
d
, d(h, d)
[wp]
h
, d(h, d)
w
d
, p
d
, d(h, d)
[wp]
d
, d(h, d)
w
h
, p
h
, w
d
, p
d
, d(h, d)
p
h
, w
h
, p
d
, d(h, d)
w
h
, w
d
, p
d
, d(h, d)
w
h
, p
h
, [wp]
d
, d(h, d)
p
h
, p
b
, p
d
, d(h, d)
p
h
, p
h+1
, p
d−1
, p
d
, d(h, d)
p
h−1
, p
h
, p
d−1
, p
d
, d(h, d)
p
h
, p
h+1
, p
d
, p
d+1
, d(h, d)
p
h−1
, p
h
, p
d
, p
d+1
, d(h, d)
Second-order
p
h
, p
d
, p
c
, d(h, d, c)
w
h
, w
d
, c
w
, d(h, d, c)
p
h
, [wp]
c
, d(h, d, c)
p
d
, [wp]
c
, d(h, d, c)
Second-order (continue)
w
h
, [wp]
c
, d(h, d, c)
w
d
, [wp]
c
, d(h, d, c)
[wp]
h
, [wp]
h+1
, [wp]
c
, d(h, d, c)
[wp]
h−1
, [wp]
h
, [wp]
c
, d(h, d, c)
[wp]
h
, [wp]
c−1
, [wp]
c
, d(h, d, c)
[wp]
h
, [wp]
c
, [wp]
c+1
, d(h, d, c)
[wp]
h−1
, [wp]
h
, [wp]
c−1
, [wp]
c
, d(h, d, c)
[wp]
h
, [wp]
h+1
, [wp]
c−1
, [wp]
c
, d(h, d, c)
[wp]
h−1
, [wp]
h
, [wp]
c
, [wp]
c+1
, d(h, d, c)
[wp]
h
, [wp]
h+1
, [wp]
c
, [wp]
c+1
, d(h, d, c)
[wp]
d
, [wp]
d+1
, [wp]
c
, d(h, d, c)
[wp]
d−1
, [wp]
d
, [wp]
c
, d(h, d, c)
[wp]
d
, [wp]
c−1
, [wp]
c
, d(h, d, c)
[wp]
d
, [wp]
c
, [wp]
c+1
, d(h, d, c)
[wp]
d
, [wp]
d+1
, [wp]
c−1
, [wp]
c
, d(h, d, c)
[wp]
d
, [wp]
d+1
, [wp]
c
, [wp]
c+1
, d(h, d, c)
[wp]
d−1
, [wp]
d
, [wp]
c−1
, [wp]
c
, d(h, d, c)
[wp]
d−1
, [wp]
d
, [wp]
c
, [wp]
c+1
, d(h, d, c)
Table 1: Base feature templates.
base feature templates are listed in Table 1, where h and d refer to the head, the dependent, respectively,
c refers to d ’s sibling or child, b refers to the word between h and d, +1 (−1) refers to the next (previous)
word, w and p refer to the surface word and part-of-speech tag, respectively, [wp] refers to the surface
word or part-of-speech tag, d(h, d) is the direction of the dependency relation between h and d, and
d(h, d, c) is the directions of the relation among h, d, and c.
We train a parser with the base features and use it as the Baseline parser. Defining f
b
(x, g) as the base
features and w
b
as the corresponding weights, the scoring function becomes,
score(x, g) = f
b
(x, g) · w
b
(2)
3 Feature Embeddings
Our goal is to reduce the sparseness of rich features by learning a distributed representation of features,
which is dense and low dimensional. We call the distributed feature representation feature embeddings.
In the representation, each dimension represents a hidden-class of the features and is expected to capture
a type of similarities or share properties among the features.
The key to learn embeddings is making use of information from a local context, and to this end
various methods have been proposed for learning word embeddings. Lin (1997) and Curran (2005) use
the count of words in a surrounding word window to represent distributed meaning of words. Brown
et al. (1992) uses bigrams to cluster words hierarchically. These methods have been shown effective
on words. However, the number of features is much larger than the vocabulary size, which makes it
infeasible to apply them on features. Another line of research induce word embeddings using neural
language models (Bengio, 2008). However, the training speed of neural language models is too slow for
the high dimensionality of features. Mikolov et al. (2013b) and Mikolov et al. (2013a) introduce efficient
methods to directly learn high-quality word embeddings from large amounts of unstructured raw text.
Since the methods do not involve dense matrix multiplications, the training speed is extremely fast.
We adapt the models of Mikolov et al. (2013b) and Mikolov et al. (2013a) for learning feature embed-
dings from large amounts of automatically parsed dependency trees. Since feature embeddings have a
high computational cost, we also use Negative sampling technique in the learning stage (Mikolov et al.,
2013b). Different from word embeddings, the input of our approach is features rather than words, and
the feature representations are generated from tree structures instead of word sequences. Consequently,