than a few hundred of millions of words, with a modest dimensionality of the word vectors between
50 - 100.
We use recently proposed techniques for measuring the quality of the resulting vector representa-
tions, with the expectation that not only will similar words tend to be close to each other, but that
words can have multiple degrees of similarity [20]. This has been observed earlier in the context
of inflectional languages - for example, nouns can have multiple word endings, and if we search for
similar words in a subspace of the original vector space, it is possible to find words that have similar
endings [13, 14].
Somewhat surprisingly, it was found that similarity of word representations goes beyond simple
syntactic regularities. Using a word offset technique where simple algebraic operations are per-
formed on the word vectors, it was shown for example that vector(”King”) - vector(”Man”) + vec-
tor(”Woman”) results in a vector that is closest to the vector representation of the word Queen [20].
In this paper, we try to maximize accuracy of these vector operations by developing new model
architectures that preserve the linear regularities among words. We design a new comprehensive test
set for measuring both syntactic and semantic regularities
1
, and show that many such regularities
can be learned with high accuracy. Moreover, we discuss how training time and accuracy depends
on the dimensionality of the word vectors and on the amount of the training data.
1.2 Previous Work
Representation of words as continuous vectors has a long history [10, 26, 8]. A very popular model
architecture for estimating neural network language model (NNLM) was proposed in [1], where a
feedforward neural network with a linear projection layer and a non-linear hidden layer was used to
learn jointly the word vector representation and a statistical language model. This work has been
followed by many others.
Another interesting architecture of NNLM was presented in [13, 14], where the word vectors are
first learned using neural network with a single hidden layer. The word vectors are then used to train
the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this
work, we directly extend this architecture, and focus just on the first step where the word vectors are
learned using a simple model.
It was later shown that the word vectors can be used to significantly improve and simplify many
NLP applications [4, 5, 29]. Estimation of the word vectors itself was performed using different
model architectures and trained on various corpora [4, 29, 23, 19, 9], and some of the resulting word
vectors were made available for future research and comparison
2
. However, as far as we know, these
architectures were significantly more computationally expensive for training than the one proposed
in [13], with the exception of certain version of log-bilinear model where diagonal weight matrices
are used [23].
2 Model Architectures
Many different types of models were proposed for estimating continuous representations of words,
including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).
In this paper, we focus on distributed representations of words learned by neural networks, as it was
previously shown that they perform significantly better than LSA for preserving linear regularities
among words [20, 31]; LDA moreover becomes computationally very expensive on large data sets.
Similar to [18], to compare different model architectures we define first the computational complex-
ity of a model as the number of parameters that need to be accessed to fully train the model. Next,
we will try to maximize the accuracy, while minimizing the computational complexity.
1
The test set is available at www.fit.vutbr.cz/
˜
imikolov/rnnlm/word-test.v1.txt
2
http://ronan.collobert.com/senna/
http://metaoptimize.com/projects/wordreprs/
http://www.fit.vutbr.cz/
˜
imikolov/rnnlm/
http://ai.stanford.edu/
˜
ehhuang/
2