时间模式驱动的用户查询新分类法：基于SVM的搜索行为理解

133 浏览量更新于2024-07-15 收藏 444KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

在信息技术领域，查询分类是理解Web查询行为的关键组成部分，它有助于搜索引擎优化用户体验和提供精准服务。现有的研究大多遵循Broder的分类框架，将查询划分为导航类（Navigational）、信息类（Informational）和事务类（Transactional），这些类别主要依据用户的查询目标和信息需求。然而，本文作者创新性地提出了从查询的时间模式角度进行分类，这是对传统分类方法的补充。查询的时间模式（Temporal Patterns of User Queries）是指查询搜索量随着时间推移所展现出的固有时间序列模式，它揭示了用户对某个主题兴趣随时间的变化趋势。这种模式能更深层次地反映出用户的搜索意图，比如季节性变化、热点事件影响或用户行为习惯。通过分析这些模式，搜索引擎可以预测未来的搜索需求，提升搜索结果的相关性和时效性，从而提高整体性能。在研究方法上，作者设计了一套基于搜索量时间序列特征的提取策略，这些特征可能包括但不限于：高峰时段、周期性、增长速度等。他们利用支持向量机（SVM，Support Vector Machine）这一强大的机器学习工具，对这些特征进行训练，以自动识别和区分不同时间模式的查询。这种方法强调了统计学习在理解和预测用户行为中的作用。为了验证这个新分类方案的有效性，研究人员在大规模的数据集——TREC的“百万查询轨迹”（Million Query Trajectories）上进行了实验。该数据集提供了丰富的用户查询历史记录，为模型的训练和评估提供了坚实的基础。实验结果显示，他们的方法在准确度和效率方面都达到了较高的水平，证明了从时间模式角度进行查询分类的可行性与价值。这篇文章在信息检索（Information Retrieval）和查询处理（Query Processing）领域引入了一个新的视角，即关注查询的时间模式来进行分类，这不仅拓展了传统的分类框架，也为个性化搜索和动态优化搜索引擎提供了新的理论依据。未来的研究可能进一步探索如何利用这些时间模式进行实时推荐或预测，以更好地满足用户的即时需求。

资源详情

资源推荐

frequency of query q within the tth time interval (the search

volume within time interval t divided by total search

volumes of all time). t = 1 ... N, where “month” is used

here. N is the number of time intervals (months). The time

series F(q) of a query q over N time intervals is a sequence

of f

(q), denoted as

Fq f q f q f q f q

(

)

(

)

(

)

(

)

(

)

{}

,,,,,……

(1)

For most queries, we can determine F(q) by mining the

query logs. However, there are some queries, especially long

queries, for which we cannot determine F(q) due to the

insufﬁcient search volume of the query logs. For example, if

we submit “a Canadian publishing house book” to Google

Trends,

it returns “Not enough search volume to show

graphs.” Thus, for these queries, we investigate how to esti-

mate F(q) without the use of query logs.

We present two estimation approaches based on the fol-

lowing phenomenon. There are basically two roles on the

Internet: information publishers and information hunters.

Intuitively, they act consistently over time. In other words,

when an event occurs, information publishers publish the

related news articles, and information hunters simultane-

ously submit queries to search them. The number of relevant

articles reﬂects the popularity of corresponding queries, and

vice versa (Adar, Teevan, Dumais, & Elsas, 2009; Dakka,

Gravano, & Ipeirotis, 2012). We expect to construct the time

series F(q) from the documents corpus when their search

records in query logs are not sufﬁcient to build the curves.

Two approaches, document-level approximation (DLA) and

word-level approximation (WLA), are proposed to solve this

problem from different levels.

DLA

Suppose that the change in a query’s search frequency

can be reﬂected by the corresponding number of relevant

documents. Therefore, for a given query q, we use its rel-

evant documents over time to approximate its time series

curve, as follows:

fq Ptd

Pdq

(

)

≈

(

)

(

)

(

)

∈

∑



(2)

where t represents the time interval within which a docu-

ment is published. P(t|d) equals 1 if the publication time

(Chen, Ma, Cui, Rui, & Huang, 2010) of document d is

within t or 0 otherwise. In this article, D

denotes the top

k = 100 Google search results of query q. The publication

time of a document is obtained by setting time constraints in

“Google Search Tools” when crawling Google search

results. P(d|q) is the relevance of document d to query q,

which is estimated with the Query Likelihood Model (Ponte

& Croft, 1998). If more relevant documents for query q are

published within the time interval t, Equation (2) will have a

larger value at t.

WLA

We also assume that a query’s search frequencies over

time can be reﬂected by the occurrence of its words. Thus,

for a given query q, we use its word-occurrence distribution

over time to approximate its time series curve:

PwtPwq

DF w

(

)

≈

(

)

(

)

(

)

∈

∏

(3)

P(w|t) is the probability of generating the query word w

within the time interval t and is estimated as the term

frequency (TF)ofw within t divided by the TF of all

words within t in the document’s corpus with Laplace

smoothing. DF(w)isthedocument frequency (DF)of

word w. From the perspective of language models for infor-

mation retrieval, P(w|q)isthequery language model.

However, most user queries are short, so P(w|q) is usually

estimated with some interpolation approaches (Manning,

Raghavan, & Schütze, 2008). In the article, we use the TF of

word w divided by the TF of all words of query q in the top

k Google search results to estimate P(w|q). k is still set to

100.

Two instances, “Earthquake” and “Boston Marathon,”

are shown in Figure 4, in which the dash-dot line is the real

time series curve built from query logs. We can see that the

three curves ﬂuctuate almost consistently over time.

Preprocessing With Time Series Analysis

Most original time series curves look different from the

typical instances in Figure 1. This is because a curve can be

decomposed into multiple components, together determin-

ing its shape (Brockwell & Davis, 2002; Kulkarni et al.,

2011). In this article, each point of F(q) is decomposed into

three components:

http://www.google.com/trends/explore

FIG. 3. Query temporal pattern detection framework.

4 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015

DOI: 10.1002/asi

剩余15页未读，继续阅读

weixin_38661800

粉丝: 4
资源: 974

时间模式驱动的用户查询新分类法：基于SVM的搜索行为理解

基于深度学习的用户异常用电模式检测.pdf

数据挖掘的异常模式检测

基于深度学习的用户异常用电模式检测

血氧检测仪的商业模式700字

为什么要考虑用户行为组合而不是单一的用户行为

健康检测手环的软件功能

通过独立按键调整时间来设置闹钟程序的流程图描述

linux入侵检测系统

tftlcd待机模式程序

stm32led流水灯，间隔时间用户指定

异常订单检测inpy

matlab 车速检测工具箱

keepalived心跳检测概念

用户行为取证分析（通过什么样的方法对具体证据内容进行分析，采用的工具、 软件、方法等），具体分析一下

跌倒行为检测方法 kinect

帮我解释一下STM32的单脉冲模式

异常检测case study

智能饮水机水温检测功能需求分析

最新资源

用户行为取证分析（通过什么样的方法对具体证据内容进行分析，采用的工具、软件、方法等），具体分析一下