Dependency-Based Word Embeddings
Omer Levy
∗
and Yoav Goldberg
Computer Science Department
Bar-Ilan University
Ramat-Gan, Israel
{omerlevy,yoav.goldberg}@gmail.com
Abstract
While continuous word embeddings are
gaining popularity, current models are
based solely on linear contexts. In this
work, we generalize the skip-gram model
with negative sampling introduced by
Mikolov et al. to include arbitrary con-
texts. In particular, we perform exper-
iments with dependency-based contexts,
and show that they produce markedly
different embeddings. The dependency-
based embeddings are less topical and ex-
hibit more functional similarity than the
original skip-gram embeddings.
1 Introduction
Word representation is central to natural language
processing. The default approach of represent-
ing words as discrete and distinct symbols is in-
sufficient for many tasks, and suffers from poor
generalization. For example, the symbolic repre-
sentation of the words “pizza” and “hamburger”
are completely unrelated: even if we know that
the word “pizza” is a good argument for the verb
“eat”, we cannot infer that “hamburger” is also
a good argument. We thus seek a representation
that captures semantic and syntactic similarities
between words. A very common paradigm for ac-
quiring such representations is based on the distri-
butional hypothesis of Harris (1954), stating that
words in similar contexts have similar meanings.
Based on the distributional hypothesis, many
methods of deriving word representations were ex-
plored in the NLP community. On one end of the
spectrum, words are grouped into clusters based
on their contexts (Brown et al., 1992; Uszkor-
eit and Brants, 2008). On the other end, words
∗
Supported by the European Community’s Seventh
Framework Programme (FP7/2007-2013) under grant agree-
ment no. 287923 (EXCITEMENT).
are represented as a very high dimensional but
sparse vectors in which each entry is a measure
of the association between the word and a particu-
lar context (see (Turney and Pantel, 2010; Baroni
and Lenci, 2010) for a comprehensive survey).
In some works, the dimensionality of the sparse
word-context vectors is reduced, using techniques
such as SVD (Bullinaria and Levy, 2007) or LDA
(Ritter et al., 2010; S
´
eaghdha, 2010; Cohen et
al., 2012). Most recently, it has been proposed
to represent words as dense vectors that are de-
rived by various training methods inspired from
neural-network language modeling (Bengio et al.,
2003; Collobert and Weston, 2008; Mnih and
Hinton, 2008; Mikolov et al., 2011; Mikolov et
al., 2013b). These representations, referred to as
“neural embeddings” or “word embeddings”, have
been shown to perform well across a variety of
tasks (Turian et al., 2010; Collobert et al., 2011;
Socher et al., 2011; Al-Rfou et al., 2013).
Word embeddings are easy to work with be-
cause they enable efficient computation of word
similarities through low-dimensional matrix op-
erations. Among the state-of-the-art word-
embedding methods is the skip-gram with nega-
tive sampling model (SKIPGRAM), introduced by
Mikolov et al. (2013b) and implemented in the
word2vec software.
1
Not only does it produce
useful word representations, but it is also very ef-
ficient to train, works in an online fashion, and
scales well to huge copora (billions of words) as
well as very large word and context vocabularies.
Previous work on neural word embeddings take
the contexts of a word to be its linear context –
words that precede and follow the target word, typ-
ically in a window of k tokens to each side. How-
ever, other types of contexts can be explored too.
In this work, we generalize the SKIP-
GRAM model, and move from linear bag-of-words
contexts to arbitrary word contexts. Specifically,
1
code.google.com/p/word2vec/