3
than an array of booleans.) In practice, N will be chosen to be a
power of two, allowing the modulus operation (line 9) to be
replaced by a simple bitwise AND.
If a more compact representation of the resulting feature vector is
required, a quick, linear scan of the boolean array can be used to
insert each occurring word hash into a hash table or into an array
containing the word hashes found. In many cases, though,
features are extracted in order to be used immediately for some
task, an in such cases the boolean array representation is often
sufficient.
A reasonably large bits array of size N=2
20
is suggested. Larger
values of N result in a greater memory demand momentarily for
the bits array, whereas smaller values of N effectively restrict the
size of the hash space, conflating features that have different 32-
bit integer hash values together into a smaller space. Using N=2
20
is easily sufficient for corpora or training sets with hundreds of
thousands of unique terms. For perspective, a corpus of 804,414
Reuters news articles contains 47,236 unique words, and a corpus
of 62,369 Arxiv.org astrophysics papers contains 99,757 unique
words [8]. Regarding the memory demands, most compilers will
allocate one byte per boolean for good performance, rather than
packing them tightly into bits. Thus, for N=2
20
the size of this
bits table is 1 MB, which is a fraction of the L2 cache for modern
CPU chips. If instead one needs to track the word count for the
bag of words representation, this array can instead be used to
contain 8-bit or 16-bit word frequency counters.
The pre-computed code table actually serves four different
functions simultaneously: word character determination,
lowercasing, hash randomization, and, as will be introduced later
in Section 5.2, Unicode processing. Table 3 shows pseudo-code
to prepare the table appropriate for 8-bit character streams. For
any character that is not a word character, the table contains zero.
For word characters, it maps the character to its lowercase
equivalent and maps the result a distinct fixed random integer
code. This randomization substantially improves the hashing
function over just using the lowercase character code directly, as
the Java String.hashCode() method does. (For reference, Java
Strings use multiplicative hashing [10], computing their hashes by
iterating hash = 31*hash + char over their characters).
It should be noted that while in our tests our word characters were
letters and numbers an our normalization is lowercasing, this is all
embodied in the creation of the code table. Any other set of rules
could be used by providing a different table. The only restriction
is that the mapped and normalized character (if any) be decidable
from a single character. This can be relaxed, but the mechanisms
for doing so are beyond the scope of this paper.
Our hashing function, which we call Mapped Additive Shift
Hashing (MASH), consists of a simple arithmetic right shift, a
table lookup (from a very small table), and an addition. As such,
it is very efficient and easy for a compiler to optimize. Although
the rightward direction of the bit-shift operation (line 6) is
somewhat non-intuitive, it actually produces substantially better
hashes than using a left-shift. The reason is that we end up using
only the low order bits of the resulting word hash (line 9), and if a
left-shift were used instead, then these lowest order bits would be
largely unaffected by the first characters of the word, especially
for a longer word. (In the case in which the modulus is a power of
two for efficiency, they would be completely unaffected by
them.) If the rightmost ten bits are kept, then only the final ten
characters can contribute to the hash. By using a right-shift, on
the other hand, and by ensuring that each character contributes 32
bits to the value, a character’s influence continues (in a degrading
fashion) for at least 31 following characters. Even for the
occasional word longer than 32 characters, note that we use a
sign-extended right-shift, so that the parity of initial letters
continues to have some effect on the hash. Of course, this
influence does diminish, and so MASH is certainly not suitable
for hashing long texts, which is the goal of typical hash function
designs such as SHA1, MD5 and Rabin hashing, which are each
much more involved to compute.
If one wishes to also produce features for each phrase bi-gram,
then at the moment when the hash is registered (line 9), we can
also register a phrase hash consisting of the current hash
combined with the previous word hash, using a symmetric
combination (such as XOR or addition) if it is desired to have
features that are independent of the order of the words or an
asymmetric combination (such as subtraction or XOR with a
shifted value) if order is to be preserved. Working in integer
space, such combination is much more efficient than working in
string space, in which typically a new string must be constructed
(or at least a new heap-based object that refers to the two strings).
2.3 Use in Full-Text Indexing
So far the discussion has primarily focused on the use case of
extracting a bag or set of words, which is needed for use in
classification or clustering applications. But the speed benefits of
our method can also be leveraged for the widespread use case of
full-text indexing of large document repositories (either large
documents or many of them or both). In this case, the indexing
engine needs to extract the sequence of words/terms in a text, and
build reverse indices of them for later information retrieval. The
key difference for our features would be that the term features are
integers rather than Strings. When later a query is submitted, the
words that the user enters are submitted to the exact same hash
processing, yielding the same integer hashes that the text has
previously been indexed by. Note that for this usage, one can use
the full 32-bit integer hash space (i.e. without the modulo N and
the bit array on lines 1 and 9 of Table 2).
3. EXPERIMENTS
In this section we present a series of experiments to evaluate the
speed and other qualities of our method compared with various
baselines. In all cases our compiler is Java JDK 1.6. Where
speed measurements are concerned, the measurements were
performed on a 2.4 GHz Intel
®
Xeon™ CPU, which is a 64-bit
architecture, running a version of Linux derived from Red Hat
version 4. Measuring elapsed time for performance can be quite
tricky, since other processes and threads regularly interrupt
processing for uncontrolled amounts of time, occasionally
slowing down the overall or average measurement substantially.
For this reason, we measure the throughput for each test over 50
separate iterations and use the measurement that shows the best
Table 3. Code table pre-compilation for SpeedyFx.
prepTable():
1 int rand[256] = 256 fixed random integers;
2 foreach (ch: 0..255) {
3 codetable[ch] = isWord(ch) ?
4 rand[toLowerCase(ch)] : 0;
5 }