A Three Level Search Engine Index
Based in Query Log Distribution
Ricardo Baeza-Yates and Felipe Saint-Jean
Center for Web Research
Department of Computer Science
Universidad de Chile
Blanco Encalada 2120, Santiago, Chile
{rbaeza,fsaint}@dcc.uchile.cl
Abstract. Queries to a search engine follow a power-law distribution, which is
far from uniform. Hence, it is natural to adapt a search engine index to the query
distribution. In this paper we present a three level memory organization for a
search engine inverted file index that includes main and secondary memory, as
well as precomputed answers, such that the use of main memory and the answer
time are significantly improved. We include experimental results as well as an
analytical model.
1 Introduction
Given the rate of growth of the Web, scalability of search engines is a key issue, as the
amount of hardware and network resources needed is large, and expensive. In addition,
search engines are popular tools, so they have heavy constraints on query answer time.
So the efficient use of resources can improve both scalability and answer time.
The query distribution in a search engine follows a very biased distribution, namely
a power or Zipf’s law, which allows to organize a search engine index such that memory
is used well and answer time is improved. For this, we just have to analyze the search
engine query log. For example, we can leave the part of the index that is really queried
in main memory and the rest in secondary memory. This is surely done by all search
engines. In addition, very common queries can be precomputed, such that the answer
time is faster.
In this paper we present an inverted file organization that has three levels: precom-
puted answers, main and secondary memory indexes. Our analytical model is based on
real search engine data which also shows the improvements obtained. We show for ex-
ample, that by using half the index in main memory we can answer 80% of the queries,
and that using a small number of precomputed answers we can improve the query an-
swer time on at least 7%. Part of our analysis shows that there is almost no correlation
between query word frequency and Web page word frequency, at least in our context.
This implies that what people search is different from what people write.
There are few papers that deal with the use of query logs to improve search engines,
because this information is usually not disclosed. The exceptions deal with caching the
We wish to thank the helpful comments of the reviewers.
M.A. Nascimento, E.S. de Moura, A.L. Oliveira (Eds.): SPIRE 2003, LNCS 2857, pp. 56–65, 2003.
c
Springer-Verlag Berlin Heidelberg 2003