滑动窗口下连续Top-k近似查询优化策略

130 浏览量更新于2024-07-15 1 收藏 1.4MB PDF 举报

滑动窗口上的近似连续Top-k查询是数据库领域中的一个重要问题，它关注在窗口滑动过程中检索出具有最高评分的k个对象。传统的方法主要依赖精确算法来解决此类问题，这些算法的核心思想是维护窗口内的一个对象子集，并尝试从中找到答案。然而，现有的解决方案存在几个显著缺点：首先，它们对查询参数和数据分布非常敏感。这意味着算法的性能会受到查询参数选择（如窗口大小、滑动步长）以及数据特性（如分布均匀性、密度）的显著影响。如果数据分布不均或参数设置不当，可能会影响查询效率和结果的准确性。其次，现有的精确算法往往成本高昂，特别是对于大规模数据集和频繁的查询操作。随着数据量的增加，维护窗口内对象子集的更新操作可能会变得复杂且消耗大量计算资源。此外，当窗口频繁移动时，这种实时性和响应时间的要求可能会成为瓶颈。为克服这些问题，研究人员Rui Zhu、Bin Wang、Shi-Ying Luo等人在2017年发表在《计算机科学技术学报》上的一篇名为"Approximate Continuous Top-k Query over Sliding Window"的研究论文中探讨了近似算法的设计。他们提出了一种新的方法，旨在减少对查询参数的依赖，并通过引入近似性来降低计算复杂度和存储需求。该论文可能提出了一种基于启发式或者数据结构优化的近似算法，例如使用哈希、排序网络或者优先队列等数据结构，通过牺牲部分精确度来提高查询速度。这可能包括使用阈值来确定哪些对象是最有可能进入Top-k列表的，或者使用启发式策略来预测窗口中未来可能的高得分对象，从而提前进行预处理。作者们还可能讨论了如何评估算法的性能，比如F1分数、精度-召回曲线等，以衡量在保持一定近似度的前提下，算法在时间和空间效率上的表现。同时，他们可能会比较自己的方法与已有的精确算法在不同场景下的效果，以便展示其优势。总结来说，这篇论文关注的是在滑动窗口环境下改进近似连续Top-k查询的效率和鲁棒性，通过引入近似策略来缓解精确算法对查询参数和数据分布的依赖，从而适用于更广泛的数据处理场景。它为处理大规模、高实时性的数据库查询提供了一个新的研究方向。

4 J. Comput. Sci. & Technol., Jan. 2017, Vol.32, No.1

{98, 97}, 98 −98 6 2, and 97 −96 6 2, we call these re-

sults as acceptable approximate results. When the win-

dow slides to W

, i.e., the objects are updated to {88,

93, 85, 77, 60, 82, 73, 70, 48, 60, 71, 66}, our algorithm

outputs {93, 88}. They are also acceptable approximate

results. When the window slides to W

, the objects are

updated to {77, 60, 82, 73, 70, 48, 60, 71, 66, 65.5, 54,

70}, and our algorithm outputs {73, 71}. Since the ex-

act results are {82, 77}, in which 82−73 > 2, we regard

73 as an unacceptable approximate result.

Assuming the window slides from W

to W

n−1

, our

proposed TAHM algorithm outputs 2 × n results dur-

ing this process. For each approximate result o

, given

Pr(|F (r

) − F (o

)| 6 2) > 0.99, the expected num-

ber of unacceptable approximate results among these

results is (1 − δ) × n × k = 0.02n. Here, F (r

) is the

j-th highest score in the approximate result set of W

and F (o

) is the j-th highest score in W

3 Framework Overview

As has been discussed, the main challenges of an-

swering approximate top-k query are: 1) for each newly

arrived object o, the algorithm should eﬃciently prune

it if o is impossible to become a query result; 2) if o is

not pruned, it should be inserted into candidate set with

low computation cost. In this paper, we tackle the chal-

lenges by designing an eﬃcient framework, named PABF

(Probabilistic Approximation Based Framework) for

supporting approximate continuous top-k query over

sliding window.

As shown in Algorithm 1, PABF mainly consists of

the following four modules: Filter, Local-Merge, Global-

Merge, and TA-Heap. Filter here is used for pruning newly

arrived objects (lines 3∼7). To be more speciﬁc, for

each object o ∈ s

, we determine whether it is a query

result (or may become a result for a future query win-

dow with a probability of at least 1 − δ). If so, o is

selected as a candidate, and inserted into a temporary

buﬀer M; otherwise, o is ignored. Compared with other

one-pass algorithms, one advantage of our algorithm is

that it can directly prune O(s − k) objects in s

with

the help of a suitable pruning value.

After scanning s

, we merge candidates in M with

candidate set B. To speed up candidate maintenance,

we partition B into a group of buckets. Formally,

given the candidate set B, a partition P(B, m) =

, b

, . . . , b

} is to partition the elements in B into

m buckets {b

, b

, . . . , b

} such that 1) 0 < |b

| <

(φ)

m−i+1

k; 2) ∀o ∈ b

, o

′

∈ b

i−1

, in which T (o) should

be larger than T (o

′

); 3) objects in each bucket are

sorted. T (o) here refers to the arrival order of o, and φ

is a coeﬃcient whose optimal value will be studied in

Section 5.

Algorithm 1:1. PABF Framework

Input: query window W , current result set R and

candidate set B

Output: updated candidate set B, and updated

result set R

1 Window maintenance: W ← W − s

, W ← W ∪ s

;

2 for i from 0 to s − 1 do

3 Bo ol bP runed ← Filter (s[i]);

4 if bP runed = false then

5 M ← M ∪ s[i];

6 else

7 Ignore s[i];

8 for i from 0 to |M| − 1 do

9 Lo cal-Merge (s[i], b

);

10 if |b

| > φk then

11 Int i ← m − 1, Bool bM erge ← true;

12 while bMerge do

13 Golbal-Merge (b

, b

i+1

);

14 if |b

| > φ

m−i+1

k then

15 bMerge ← false;

16 R ← TA-Heap (B, M);

17 Return;

Based on the partition result, the merge operation

can be divided into two phases: Local-Merge and Global-

Merge. In the Local-Merge phase, we sort candidates ac-

cording to their scores via merge sort. We then try to

combine candidates in M with the ones in b

together,

if their scores are roughly the same. Besides, we also

maintain the dominate number for each element in b

and delete meaningless ones. Through the above ope-

rations, the size of b

can be eﬀectively reduced. Since

the overhead of merge cost mainly depends on the size

of b

, a smaller size of b

helps us further reduce the

cost of merge. After Local-Merge, if |b

| > φk, our algo-

rithm enters the Global-Merge phase that further merges

with b

m−1

(lines 10∼15). The following operations

are repeated until the condition |b

| < φ

m−i+1

k is sat-

isﬁed. Note that in this phase, we could further reduce

the cost of candidates maintenance by computing the

optimal partition.

Now we demonstrate how PABF supports the query.

We maintain k

′

(k < k

′

< 2k) objects with the high-

est scores in set R. When a query result leaves the

window, we retrieve the new results from R if k < k

′

Otherwise, we directly retrieve new answers from B,

剩余16页未读，继续阅读

weixin_38667207

粉丝: 3
资源: 965

滑动窗口下连续Top-k近似查询优化策略

对于统一的查询频率或隶属度，在滑动窗口上的数据流中接近最优的近似重复检测

Sliding-window-topk.ppt

8051Proteus仿真c源码用单片机控制直流电机

全球与中国高纯度异丙醇（IPA）市场现状及未来发展趋势（2024版）.docx

Yolov3：只用opencv直接实现【图片】｜【视频】｜【摄像头实时】目标定位与目标检测

神农之眼——农业病虫害防治识别app.zip

8051Proteus仿真c源码用do-while语句控制P0口8位LED流水点亮

公开整理-3600+Bank数据库.xlsx

Sigrity-OptimizePI-Sigrity Device Optimization with.rar

8051Proteus仿真c源码无软件消抖的独立式键盘输入实验

最新资源