
Performance of Compressed Inverted List Caching
in Search Engines
∗
Jiangong Zhang
CIS Department
Polytechnic University
Brooklyn, NY 11201, USA
zjg@cis.poly.edu
Xiaohui Long
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
xiaohui.long@microsoft.com
Torsten Suel
CIS Department
Polytechnic University
Brooklyn, NY 11201, USA
suel@poly.edu
ABSTRACT
Due to the rapid growth in the size of the web, web search engines
are facing enormous performance challenges. The larger engines in
particular have to be able to process tens of thousands of queries per
second on tens of billions of documents, making query throughput
a critical issue. To satisfy this heavy workload, search engines use a
variety of performance optimizations including index compression,
caching, and early termination.
We focus on two techniques, inverted index compression and in-
dex caching, which play a crucial rule in web search engines as
well as other high-performance information retrieval systems. We
perform a comparison and evaluation of several inverted list com-
pression algorithms, including new variants of existing algorithms
that have not been studied before. We then evaluate different in-
verted list caching policies on large query traces, and finally study
the possible performance benefits of combining compression and
caching. The overall goal of this paper is to provide an updated dis-
cussion and evaluation of these two techniques, and to show how to
select the best set of approaches and settings depending on param-
eter such as disk speed and main memory cache size.
Categories and Subject Descriptors
H.3.1 [Information Systems]: Content Analysis and Indexing—
Indexing methods; H.3.3 [Information Systems]: Information Search
and Retrieval—Search process.
General Terms
Performance, Experimentation.
Keywords
Search engines, inverted index, index compression, index caching.
1. INTRODUCTION
Web search engines are probably the most popular tools for lo-
cating information on the world-wide web. However, due to the
rapid growth of the web and the number of users, search engines
are faced with formidable performance challenges. On one hand,
search engines have to integrate more and more advanced tech-
niques for tasks such as high-quality ranking, personalization, and
spam detection. On the other hand, they have to be able to process
tens of thousands of queries per second on tens of billions of pages;
thus query throughput is a very critical issue.
In this paper, we focus on the query throughput challenge. To
guarantee throughput and fast response times, current large search
engines are based on large clusters of hundreds or thousands of
∗
Research supported by NSF ITR Award CNS-0325777. Work by the second author
was performed while he was a PhD student at Polytechnic University.
Copyright is held by the International World Wide Web Conference Com-
mittee (IW3C2). Distribution of these papers is limited to classroom use,
and personal use by others.
WWW 2008, April 21–25, 2008, Beijing, China.
ACM 978-1-60558-085-2/08/04.
servers, where each server is responsible for searching a subset of
the web pages, say a few million to hundreds of millions of pages.
This architecture successfully distributes the workload over many
servers. Thus, to maximize overall throughput, we need to maxi-
mize throughput on a single node, still a formidable challenge given
the data size per node. Current search engines use several tech-
niques such as index compression, index caching, result caching,
and query pruning (early termination) to address this issue.
We consider two important techniques that have been previously
studied in web search engines and other IR systems, inverted index
compression and inverted index caching. Our goal is to provide an
evaluation of state-of-the-art implementations of these techniques,
and to study how to combine these techniques for best overall per-
formance on current hardware. To do this, we created highly op-
timized implementations of existing fast index compression algo-
rithms, including several new variations of such algorithms, and
evaluated these on large web page collections and real search en-
gine traces. We also implemented and evaluated various caching
schemes for inverted index data, and studied the performance gains
of combining compression and caching depending on disk transfer
rate, cache size, and processor speed. We believe that this provides
an interesting and up-to-date picture of these techniques that can
inform both system developers and researchers interested in query
processing performance issues.
2. TECHNICAL BACKGROUND
Web search engines as well as many other IR systems are based
on an inverted index, which is a simple and efficient data structure
that allows us to find all documents that contain a particular term.
Given a collection of N documents, we assume that each docu-
ment is identified by a unique document ID (docID) between 0 and
N − 1. An inverted index consists of many inverted lists, where
each inverted list I
w
is a list of postings describing all places where
term w occurs in the collection. More precisely, each posting con-
tains the docID of a document that contains the term w, the number
of occurrences of w in the document (called frequency), and some-
times also the exact locations of these occurrences in the document
(called positions), plus maybe other context such as the font size of
the term etc. The postings in an inverted list are typically sorted by
docID, or sometimes some other measure.
We consider two cases: (i) the case where we have docIDs and
frequencies, i.e., each posting is of the form (d
i
, f
i
), and (ii) the
case where we also store positions, i.e., each posting is of the form
(d
i
, f
i
, p
i,0
, . . . , p
i,f req−1
). We use word-oriented positions, i.e.,
p
i,j
= k if a word is the k-th word in the document. For many
well-known ranking functions, it suffices to store only docIDs and
frequencies, while in other cases positions are needed.
On first approximation, search engines process a search query
“dog, cat” by fetching and traversing the inverted lists for “dog”
and “cat”. During this traversal, they intersect (or merge) the post-
ings from the two lists in order to find documents containing all (or
at least one) of the query terms. At the same time, the engine also
387
WWW 2008 / Refereed Track: Search - Corpus Characterization & Search Performance Beijing, China