Locality in Search Engine Queries and Its
Implications for Caching
Yinglian Xie and David O’Hallaron
Department of Computer Science, Carnegie Mellon University
Email:
ylxie, droh
@cs.cmu.edu
Abstract—Caching is a popular technique for reducing both server load
and user response time in distributed systems. In this paper, we consider the
question of whether caching might be effective for search engines as well.
We study two real search engine traces by examining query locality and its
implications for caching. Our trace analysis results show that: (1) Queries
have significant locality, with query frequency following a Zipf distribution.
Very popular queries are shared among different users and can be cached
at servers or proxies, while 16% to 22% of the queries are from the same
users and should be cached at the user side. Multiple-word queries are
shared less and should be cached mainly at the user side. (2) If caching is
to be done at the user side, short-term caching for hours will be enough to
cover query temporal locality, while server/proxy caching should use longer
periods, such as days. (3) Most users have small lexicons when submitting
queries. Frequent users who submit many search requests tend to reuse
a small subset of words to form queries. Thus, with proxy or user side
caching, prefetching based on user lexicon looks promising.
I. INTRODUCTION
ACHING is an important technique to reduce server work-
load and user response time. For example, clients can send
requests to proxies, which then respond using locally cached
data. By caching frequently accessed objects in the proxy cache,
the transmission delays of these objects are minimized because
they are served from nearby caches instead of remote servers. In
addition, by absorbing a portion of the workload, proxy caches
can increase the capacity of both servers and networks, thereby
enabling them to service a potentially larger clientele.
We are interested in the question of whether caching might be
effective for search engines as well. Because serving a search
request requires a significant amount of computation as well as
I/O and network bandwidth, caching search results could im-
prove performance in three ways. First, repeated query results
are fetched without redundant processing to minimize the ac-
cess latency. Second, because of the reduction in server work-
load, scarce computing cycles in the server are saved, allowing
these cycles to be applied to more advanced algorithms and po-
tentially better results. Finally, by disseminating user requests
among proxy caches, we can distribute part of the computational
tasks and customize search results based on user contextual in-
formation.
Although Web caching has been widely studied, few re-
searchers have tackled the problem of caching search engine re-
sults. While it is already known that search engine queries have
significant locality, several important questions are still open:
Where should we cache search engine results? Should we
cache them at the server’s machine, at the user’s machine, or
in intermediate proxies? To determine which type of caching
would result in the best hit rates, we need to look at the degree
of query popularity at each level and whether queries will be
shared among different users.
How long should we keep a query in cache before it becomes
stale?
What might other benefits accrue from caching? Since both
proxy and client side caching are distributed ways of serving
search requests, can we prefetch or re-rank query results based
on individual user requirements?
With respect to the above questions, we study two real search
engine traces. We investigate their implications for caching
search engine results. Our analysis yielded the following key
results:
Queries have significant locality. About 30% to 40% of
queries are repeated queries that have been submitted before.
Query repetition frequency follows a Zipf distribution. The pop-
ular queries with high repetition frequencies are shared among
different users and can be cached at servers or proxies. Queries
are also frequently repeated by the same users. About 16% to
22% of all queries are repeated queries from the same users,
which should be cached at the user side. Multiple-word queries
are less likely to be shared by different users. Thus they can also
be cached mainly at the user side.
The majority of the repeated queries are referenced again
within short time intervals. But there remains a significant por-
tion of queries that are repeated within relatively longer time in-
tervals. They are largely shared by different users. So if caching
is to be done at the user side, short-term caching for hours will
be enough to cover query temporal locality, while server/proxy
caching should be based on longer periods, on the order of days.
Most users have small lexicons when submitting queries. Fre-
quent users who submit many search requests tend to reuse a
small subset of words to form queries. Thus, with proxy or user
side caching, prefetching based on user lexicons is promising.
Proxy or user side caching also provide us with opportunities
to improve query results based on individual user preferences,
which is an important future research direction.
In the rest of the paper, we first discuss related works in Sec-
tion I-A. We then describe the traces we analyzed and summa-
rize the general statistics of the data in Section II. In Section
III, we focus on repeated queries and discuss query locality in
both traces. Section IV presents our findings about user lexicon
analysis and its implications. Finally, we review analysis results
and discuss possible future research directions.
A. Related Work
Due to the exponential growth of the Web, there has been
much research on the impact of Web caching and how to max-
imize its performance benefits. Most Web browsers support
caching documents in the client’s memory or local disk to re-
0-7803-7476-2/02/$17.00 (c) 2002 IEEE. 1238 IEEE INFOCOM 2002