Measuring Word Significance
using
Distributed Representations of Words
Adriaan M. J. Schakel
NNLP
adriaan.schakel@gmail.com
Benjamin J. Wilson
Lateral GmbH
benjamin@lateral.io
Abstract
Distributed representations of words as
real-valued vectors in a relatively low-
dimensional space aim at extracting syn-
tactic and semantic features from large text
corpora. A recently introduced neural net-
work, named word2vec (Mikolov et al.,
2013a; Mikolov et al., 2013b), was shown
to encode semantic information in the di-
rection of the word vectors. In this brief
report, it is proposed to use the length of
the vectors, together with the term fre-
quency, as measure of word significance in
a corpus. Experimental evidence using a
domain-specific corpus of abstracts is pre-
sented to support this proposal. A use-
ful visualization technique for text corpora
emerges, where words are mapped onto a
two-dimensional plane and automatically
ranked by significance.
1 Introduction
Discovering the underlying topics or discourses in
large text corpora is a challenging task in natu-
ral language processing (NLP). A statistical ap-
proach often starts by determining the frequency
of occurrence of terms across the corpus, and us-
ing the term frequency as a criterion for word
significance—a thesis put forward in a seminal pa-
per by Luhn (Luhn, 1958). From the list of terms
ranked by frequency, terms that are either too rare
or too common are usually dropped, for they are
of little use. For a domain-specific corpus, the top
ranked terms in the trimmed list often nicely sum-
marize the main topics of the corpus, as will be
illustrated below.
For more detailed corpus analysis, such as dis-
covering the subtopics covered by the documents
in the corpus, the term frequency list by itself is,
however, of limited use. The main problem is that
within a given frequency range, function words,
which primarily have an organizing function and
carry little or no meaning, appear together with
content words, which represent central features of
texts and carry the meaning of the context. In other
words, the rank of a term in the frequency list is by
itself not indicative of meaning (Luhn, 1958).
This problem can be tackled by replacing the
corpus-wide term frequency with a more refined
weighting scheme based on document-specific
term frequency (Aizawa, 2000). In such a scheme,
a document is taken as the context in which a word
appears. Since key words are typically repeated
in a document, they tend to cluster and to be less
evenly distributed across a text corpus than func-
tion words of the same frequency. The fraction
of documents containing a given term can then be
used to distinguish them. Much more elaborate
statistical methods have been developed to further
explore the distribution of terms in collections of
documents, such as topic modeling (Blei et al.,
2003) and spacing statistics (Ortu
˜
no et al., 2002).
An even more refined weighting scheme is ob-
tained by reducing the context of a word from
the document in which it appears to a window of
just a few words. Such a scheme is suggested
by Harris’ distributional hypothesis (Harris, 1954)
which states “that it is possible to define a linguis-
tic structure solely in terms of the ‘distributions’
(= patterns of co-occurrences) of its elements”, or
as Firth famously put it (Firth, 1957) “a word is
characterized by the company it keeps”.
Word co-occurrence is at the heart of several
machine learning algorithms, including the re-
cently introduced word2vec by Mikolov and col-
laborators (Mikolov et al., 2013a; Mikolov et al.,
2013b). Word2vec is a neural network with a sin-
gle hidden layer that uses word co-occurrence for
learning a relatively low-dimensional vector rep-
resentation of each word in a corpus, a so-called
distributed representation (Hinton, 1986). The di-
arXiv:1508.02297v1 [cs.CL] 10 Aug 2015