226 S. Jonnalagadda et al.
2.1 Distributional Semantics
Methods of distributional semantics can be classified broadly as either probabilistic or
geometric. Probabilistic models view documents as mixtures of topics, allowing terms
to be represented according to the probability of their being encountered during the
discussion of a particular topic. Geometric models, of which Random Indexing is an
exemplar, represent terms as vectors in multi-dimensional space, the dimensions of
which are derived from the distribution of terms across defined contexts, which may
include entire documents, regions within documents or grammatical relations. For
example, Latent Semantic Analysis (LSA) [18] uses the entire document as the con-
text, by generating a term-document matrix in which each cell corresponds to the
number of times a term occurs in a document. On the other hand, the Hyperspace
Analog to Language (HAL) model [20] uses the words surrounding the target term as
the context, by generating a term-term matrix to note the number of times a given
term occurs in the neighborhood of every other term. In contrast, Schütze’s
Wordspace [28] defines a sliding window of around 1000 frequently-occurring four-
grams as a context, resulting in a term-by-four-gram matrix. Usually, the magnitude
of the term vectors depends on the frequency of occurrence of the terms in the corpus
and the direction depends on the terms relationship with the chosen base vectors.
Random Indexing: Most of the distributional semantics models have high computa-
tional and storage cost associated with building the model or modifying it because of
the large number of dimensions when a large corpus is modeled. While dimensional-
ity reduction techniques such as Singular Value Decomposition (SVD) are able to
generate a reduced-dimensional approximation of a term-by-context matrix, this com-
pression comes at considerable computational cost. For example, the time complexity
of SVD with standard algorithms is essentially cubic [4]. Recently, Random Indexing
[14] emerged as promising alternative to the use of SVD for the dimension reduction
step in the generation of term-by-context vectors. Random Indexing and other similar
methods are motivated by the Johnson–Lindenstrauss Lemma [12] which states that
the distance between points in a vector space will be approximately preserved if they
are projected into a reduced-dimensional subspace of sufficient dimensionality. While
this procedure requires a fraction of the RAM and processing power of Singular
Value Decomposition, it is able to produce term–term associations [14] of similar
accuracy to those produced by SVD-based Latent Semantic Analysis.
Random Indexing avoids the need to construct and subsequently reduce the dimen-
sions of a term-by-context matrix by generating a reduced-dimensional matrix di-
rectly. This is accomplished by assigning to each context a sparse high-dimensional
(on the order of 1000) elemental vector of the dimensionality of the reduced dimen-
sional space to be generated. These vectors consist mostly of zeros, but a small num-
ber (on the order of 10) +1 and -1 values are randomly distributed across the vector.
Given the many possible permutations of a small number +1’s and -1’s in a high-
dimensional space, it is likely that most of the assigned index vectors will be close-to-
orthogonal (almost perpendicular) to one another. Consequently, rather than
constructing a full term-by-context matrix in which each context is represented as an
independent dimension, a reduced-dimensional matrix in which each context is repre-
sented as a close-to-independent vector is constructed. Term vectors are then