One possible choice to define the probability of a
context word is the softmax:
p(w
c
| w
t
) =
e
s(w
t
, w
c
)
P
W
j=1
e
s(w
t
, j)
.
However, such a model is not adapted to our case as
it implies that, given a word w
t
, we only predict one
context word w
c
.
The problem of predicting context words can in-
stead be framed as a set of independent binary clas-
sification tasks. Then the goal is to independently
predict the presence (or absence) of context words.
For the word at position t we consider all context
words as positive examples and sample negatives at
random from the dictionary. For a chosen context
position c, using the binary logistic loss, we obtain
the following negative log-likelihood:
log
1 + e
−s(w
t
, w
c
)
+
X
n∈N
t,c
log
1 + e
s(w
t
, n)
,
where N
t,c
is a set of negative examples sampled
from the vocabulary. By denoting the logistic loss
function ` : x 7→ log(1 + e
−x
), we can re-write the
objective as:
T
X
t=1
X
c∈C
t
`(s(w
t
, w
c
)) +
X
n∈N
t,c
`(−s(w
t
, n))
.
A natural parameterization for the scoring function
s between a word w
t
and a context word w
c
is to use
word vectors. Let us define for each word w in the
vocabulary two vectors u
w
and v
w
in R
d
. These two
vectors are sometimes referred to as input and out-
put vectors in the literature. In particular, we have
vectors u
w
t
and v
w
c
, corresponding, respectively, to
words w
t
and w
c
. Then the score can be computed
as the scalar product between word and context vec-
tors as s(w
t
, w
c
) = u
>
w
t
v
w
c
. The model described
in this section is the skipgram model with negative
sampling, introduced by Mikolov et al. (2013b).
3.2 Subword model
By using a distinct vector representation for each
word, the skipgram model ignores the internal struc-
ture of words. In this section, we propose a different
scoring function s, in order to take into account this
information.
Each word w is represented as a bag of character
n-gram. We add special boundary symbols < and >
at the beginning and end of words, allowing to dis-
tinguish prefixes and suffixes from other character
sequences. We also include the word w itself in the
set of its n-grams, to learn a representation for each
word (in addition to character n-grams). Taking the
word where and n = 3 as an example, it will be
represented by the character n-grams:
<wh, whe, her, ere, re>
and the special sequence
<where>.
Note that the sequence <her>, corresponding to the
word her is different from the tri-gram her from the
word where. In practice, we extract all the n-grams
for n greater or equal to 3 and smaller or equal to 6.
This is a very simple approach, and different sets of
n-grams could be considered, for example taking all
prefixes and suffixes.
Suppose that you are given a dictionary of n-
grams of size G. Given a word w, let us denote by
G
w
⊂ {1, . . . , G} the set of n-grams appearing in
w. We associate a vector representation z
g
to each
n-gram g. We represent a word by the sum of the
vector representations of its n-grams. We thus ob-
tain the scoring function:
s(w, c) =
X
g∈G
w
z
>
g
v
c
.
This simple model allows sharing the representa-
tions across words, thus allowing to learn reliable
representation for rare words.
In order to bound the memory requirements of our
model, we use a hashing function that maps n-grams
to integers in 1 to K. We hash character sequences
using the Fowler-Noll-Vo hashing function (specifi-
cally the FNV-1a variant).
1
We set K = 2.10
6
be-
low. Ultimately, a word is represented by its index
in the word dictionary and the set of hashed n-grams
it contains.
4 Experimental setup
4.1 Baseline
In most experiments (except in Sec. 5.3), we
compare our model to the C implementation
1
http://www.isthe.com/chongo/tech/comp/fnv