frequency of query q within the tth time interval (the search
volume within time interval t divided by total search
volumes of all time). t = 1 ... N, where “month” is used
here. N is the number of time intervals (months). The time
series F(q) of a query q over N time intervals is a sequence
of f
t
(q), denoted as
Fq f q f q f q f q
tN
(
)
=
(
)
(
)
(
)
(
)
{}
12
,,,,,……
(1)
For most queries, we can determine F(q) by mining the
query logs. However, there are some queries, especially long
queries, for which we cannot determine F(q) due to the
insufficient search volume of the query logs. For example, if
we submit “a Canadian publishing house book” to Google
Trends,
2
it returns “Not enough search volume to show
graphs.” Thus, for these queries, we investigate how to esti-
mate F(q) without the use of query logs.
We present two estimation approaches based on the fol-
lowing phenomenon. There are basically two roles on the
Internet: information publishers and information hunters.
Intuitively, they act consistently over time. In other words,
when an event occurs, information publishers publish the
related news articles, and information hunters simultane-
ously submit queries to search them. The number of relevant
articles reflects the popularity of corresponding queries, and
vice versa (Adar, Teevan, Dumais, & Elsas, 2009; Dakka,
Gravano, & Ipeirotis, 2012). We expect to construct the time
series F(q) from the documents corpus when their search
records in query logs are not sufficient to build the curves.
Two approaches, document-level approximation (DLA) and
word-level approximation (WLA), are proposed to solve this
problem from different levels.
DLA
Suppose that the change in a query’s search frequency
can be reflected by the corresponding number of relevant
documents. Therefore, for a given query q, we use its rel-
evant documents over time to approximate its time series
curve, as follows:
fq Ptd
Pdq
Pdq
t
dD
dD
k
k
(
)
≈
(
)
(
)
(
)
∈
∈
∑
∑
ˆ
(2)
where t represents the time interval within which a docu-
ment is published. P(t|d) equals 1 if the publication time
(Chen, Ma, Cui, Rui, & Huang, 2010) of document d is
within t or 0 otherwise. In this article, D
k
denotes the top
k = 100 Google search results of query q. The publication
time of a document is obtained by setting time constraints in
“Google Search Tools” when crawling Google search
results. P(d|q) is the relevance of document d to query q,
which is estimated with the Query Likelihood Model (Ponte
& Croft, 1998). If more relevant documents for query q are
published within the time interval t, Equation (2) will have a
larger value at t.
WLA
We also assume that a query’s search frequencies over
time can be reflected by the occurrence of its words. Thus,
for a given query q, we use its word-occurrence distribution
over time to approximate its time series curve:
fq
PwtPwq
DF w
t
wq
(
)
≈
(
)
(
)
(
)
∈
∏
(3)
P(w|t) is the probability of generating the query word w
within the time interval t and is estimated as the term
frequency (TF)ofw within t divided by the TF of all
words within t in the document’s corpus with Laplace
smoothing. DF(w)isthedocument frequency (DF)of
word w. From the perspective of language models for infor-
mation retrieval, P(w|q)isthequery language model.
However, most user queries are short, so P(w|q) is usually
estimated with some interpolation approaches (Manning,
Raghavan, & Schütze, 2008). In the article, we use the TF of
word w divided by the TF of all words of query q in the top
k Google search results to estimate P(w|q). k is still set to
100.
Two instances, “Earthquake” and “Boston Marathon,”
are shown in Figure 4, in which the dash-dot line is the real
time series curve built from query logs. We can see that the
three curves fluctuate almost consistently over time.
Preprocessing With Time Series Analysis
Most original time series curves look different from the
typical instances in Figure 1. This is because a curve can be
decomposed into multiple components, together determin-
ing its shape (Brockwell & Davis, 2002; Kulkarni et al.,
2011). In this article, each point of F(q) is decomposed into
three components:
2
http://www.google.com/trends/explore
FIG. 3. Query temporal pattern detection framework.
4 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015
DOI: 10.1002/asi