Analogies Explained: Towards Understanding Word Embeddings
Carl Allen
1
Timothy Hospedales
1
Abstract
Word embeddings generated by neural network
methods such as word2vec (W2V) are well known
to exhibit seemingly linear behaviour, e.g. the
embeddings of analogy “woman is to queen as
man is to king” approximately describe a paral-
lelogram. This property is particularly intriguing
since the embeddings are not trained to achieve
it. Several explanations have been proposed, but
each introduces assumptions that do not hold in
practice. We derive a probabilistically grounded
definition of paraphrasing that we re-interpret
as word transformation, a mathematical descrip-
tion of “
w
x
is to
w
y
”. From these concepts we
prove existence of linear relationships between
W2V-type embeddings that underlie the analogi-
cal phenomenon, identifying explicit error terms.
1. Introduction
The vector representation, or embedding, of words under-
pins much of modern machine learning for natural language
processing (e.g. Turney & Pantel (2010)). Where, previ-
ously, embeddings were generated explicitly from word
statistics, neural network methods are now commonly used
to generate neural embeddings that are of low dimension
relative to the number of words represented, yet achieve
impressive performance on downstream tasks (e.g. Turian
et al. (2010); Socher et al. (2013)). Of these, word2vec
2
(W2V) (Mikolov et al., 2013a) and Glove (Pennington et al.,
2014) are amongst the best known and on which we focus.
Interestingly, such embeddings exhibit seemingly linear be-
haviour (Mikolov et al., 2013b; Levy & Goldberg, 2014a),
e.g. the respective embeddings of analogies, or word rela-
tionships of the form “
w
a
is to
w
a
∗
as
w
b
is to
w
b
∗
”, often
satisfy
w
a
∗
−w
a
+ w
b
≈ w
b
∗
, where
w
i
is the embedding
1
School of Informatics, University of Edinburgh. Correspondence
to: Carl Allen <carl.allen@ed.ac.uk>.
Proceedings of the
36
th
International Conference on Machine
Learning, Long Beach, California, PMLR 97, 2019. Copyright
2019 by the author(s).
2
Throughout, we refer to the more commonly used Skipgram im-
plementation of W2V with negative sampling (SGNS).
of word
w
i
. This enables analogical questions such as “man
is to king as woman is to ..?” to be solved by vector addi-
tion and subtraction. Such high order structure is surprising
since word embeddings are trained using only pairwise word
co-occurrence data extracted from a text corpus.
We first show that where embeddings factorise pointwise mu-
tual information (PMI), it is paraphrasing that determines
when a linear combination of embeddings equates to that of
another word. We say
king
paraphrases
man
and
royal
, for
example, if there is a semantic equivalence between
king
and
{man, royal}
combined. We can measure such equiva-
lence with respect to probability distributions over nearby
words, in line with Firth’s maxim “You shall know a word
by the company it keeps” (Firth, 1957). We then show that
paraphrasing can be reinterpreted as word transformation
with additive parameters (e.g. from
man
to
king
by adding
royal
) and generalise to also allow subtraction. Finally, we
prove that by interpreting an analogy “
w
a
is to
w
a
∗
as
w
b
is to
w
b
∗
” as word transformations
w
a
to
w
a
∗
and
w
b
to
w
b
∗
sharing the same parameters, the linear relationship
observed between word embeddings of analogies follows
(see overview in Fig 4). Our key contributions are:
•
to derive a probabilistic definition of paraphrasing and
show that it governs the relationship between one (PMI-
derived) word embedding and any sum of others;
•
to show how paraphrasing can be generalised and inter-
preted as the transformation from one word to another,
giving a mathematical formulation for “w
x
is to w
x
∗
”;
•
to provide the first rigorous proof of the linear relation-
ship between word embeddings of analogies, including
explicit, interpretable error terms; and
•
to show how these relationships materialise between
vectors of PMI values, and so too in word embeddings
that factorise the PMI matrix, or approximate such a
factorisation e.g. W2V and Glove.
2. Previous Work
Intuition for the presence of linear analogical relationships,
or linguistic regularity, amongst word embeddings was first
suggested by Mikolov et al. (2013a;b) and Pennington et al.
(2014), and has been widely discussed since (e.g. Levy &
Goldberg (2014a); Linzen (2016)). More recently, several
theoretical explanations have been proposed: