Inverted Files for Text Search Engines 11
to the query. Once all query terms have been processed, similarity scores S
d
are calcu-
lated by dividing each accumulator value by the corresponding value of W
d
. Finally, the
r largest similarities are identified, and the corresponding documents returned to the
user.
The cost of ranking via an index is far less than with the exhaustive algorithm
outlined in Figure 2. Given a query of three terms, processing a query against the Web
data involves finding the three terms in the vocabulary; fetching and then processing
three inverted lists of perhaps 100kB to 1MB each; and making two linear passes over
an array of 12,000,000 accumulators. The complete sequence requires well under a
second on current desktop machines.
Nonetheless, the costs are still significant. Disk space is required for the index at
20%–60% of the size of the data for an index of the type shown in Figure 3; memory is
required for an accumulator for each document and for some or all of the vocabulary;
CPU time is required for processing inverted lists and accumulators; and disk traffic
is used to fetch inverted lists. Fortunately, compared to the implementation shown in
Figure 4, all of these costs can be dramatically reduced.
Indexing Word Positions.
We have described inverted lists as sequences of index entries,
each a d, f
d,t
pair. An index of this form is document-level since it indicates whether a
term occurs in a document but does not contain information about precisely where the
term appears. Given that the frequency f
d,t
represents the number of occurrences of t
in d , it is straightforward to modify each entry to include the f
d,t
ordinal word positions
p at which t occurs in d and create a word-level inverted list containing pointers of the
form d, f
d,t
, p
1
, ..., p
f
d,t
. Note that in this representation positions are word counts,
not byte counts, so that they can be used to determine adjacency.
Word positions can be used in a variety of ways during query evaluation. Section 4
discusses one of these, phrase queries in which the user can request documents with
a sequence rather than bag-of-words. Word positions can also be used in bag-of-word
queries, for example, to prefer documents where the terms are close together or are
close to the beginning of the document. Similarity measures that make use of such
proximity mechanisms have not been particularly successful in experimental settings
but, for simple queries, adjacency and proximity do appear to be of value in Web
retrieval.
If the source document has a hierarchical structure, that structure can be reflected
by a similar hierarchy in the inverted index. For example, a document with a structure
of chapters, sections, and paragraphs might have word locations stored as (c, s, p, w)
tuples coded as a sequence of nested runs of c-gaps, s-gaps, p-gaps, and w-gaps. Such
an index allows within-same-paragraph queries as well as phrase queries, for example,
and with an appropriate representation, is only slightly more expensive to store than
a nonhierarchical index.
Core Ideas. To end this section, we state several key implementation decisions.
—Documents have ordinal identifiers, numbered from one.
—Inverted lists are stored contiguously.
—The vocabulary consists of every term occurring in the documents and is stored in a
simple extensible structure such as a B-tree.
—An inverted list consists of a sequence of pairs of document numbers and in-document
frequencies, potentially augmented by word positions.
—The vocabulary may be preprocessed, by stemming and stopping.
—Ranking involves a set of accumulators and term-by-term processing of inverted lists.
ACM Computing Surveys, Vol. 38, No. 2, Article 6, Publication date: July 2006.