概率反向最近邻查询：不确定数据的高效处理

151 浏览量更新于2024-07-15 收藏 682KB PDF 举报

"这篇研究论文探讨了在处理不确定数据时概率反向最近邻查询的高效方法。面对现实世界中由于测量设备限制、环境干扰或应用特性（如移动对象监测）导致的数据不确定性，传统的精确数据库查询方法无法直接应用于不确定场景。论文提出了概率反向最近邻（PRNN）查询的概念，这是一种在用户指定阈值以上的概率下获取可能是反向最近邻的数据对象的方法。" 在实际应用中，反向最近邻（RNN）查询是一个至关重要的问题。它在数据库中寻找所有将查询对象作为其最近邻的数据对象。然而，当数据存在不确定性，如不精确测量或环境因素导致的数据模糊时，以往用于精确数据库的RNN查询解决方案就显得不足。这篇论文首次将RNN查询的概念扩展到不确定数据库领域，提出了概率反向最近邻（PRNN）查询。 PRNN查询的目标是找到那些具有高于或等于用户设定概率阈值成为查询对象反向最近邻的数据对象。这一概念的提出旨在解决不确定性带来的挑战，确保查询结果不仅考虑距离，还考虑了数据的可靠性或可能性。论文可能涵盖了以下几个方面： 1. **不确定性的建模**：讨论如何表示和处理不确定数据，可能包括概率分布模型、区间或云模型等。 2. **查询处理算法**：设计适应不确定数据的PRNN查询算法，可能涉及概率计算、随机模拟或其他优化技术。 3. **性能评估**：通过实验验证算法的效率和准确性，比较不同策略下的查询性能。 4. **阈值影响分析**：探究用户设定的不同概率阈值对查询结果的影响。 5. **应用案例**：可能包含实际应用示例，展示PRNN查询在现实世界问题中的应用价值，如目标跟踪、地理信息系统或社交网络分析。此研究论文对于理解和开发处理不确定数据的高效查询方法具有重要意义，为大数据分析和决策支持提供了新的工具和思路。通过深入理解PRNN查询，我们可以更好地应对现实世界中由于数据不确定性带来的挑战，提高数据挖掘和分析的准确性和可靠性。

Symbols Descriptions

D (A, or B) the uncertain database(s)

d the dimensionality of the data set

UR( o) the uncertainty region of object o

¯(o) the hypersphere that tightly bounds UR(o)

α the user-speciﬁed probability threshold

dist(·, ·) the Euclidean distance between two objects

Table 1 Notations and Descriptions

(NN) of each data object, and then check whether or

not this object is closer to query point than to its NN.

However this requires a sequential scan over the entire

data set and incurs high computation cost (I/O cost

as well). In order to reduce such cost, many pruning

techniques [18,36,41,33,38] have been proposed, which

assume that data objects are precise points.

Similarly, for our PRNN problem, it is also ineﬃ-

cient to apply the sequential scan method. Moreover,

the inherent uncertainty of data objects makes the ex-

isting RNN techniques inapplicable to the PRNN re-

trieval. For example, the pruning region deﬁned in the

state-of-the-art TPL method [38] is invalid in the PRNN

case, since the positions of objects are uncertain. Fur-

thermore, in order to reﬁne PRNN candidates, the cal-

culation of the complex Inequality (1) requires access-

ing all the objects in the database, which is very costly.

Therefore, in the sequel, we ﬁrst design an eﬀective

pruning method that is speciﬁc for our PRNN query

processing, and then reduce the number of objects that

are involved in the procedure of reﬁning candidates. Ta-

ble 1 summarizes the commonly-used symbols in this

paper.

4 Probabilistic RNN Search

As mentioned above, it is not eﬃcient to answer PRNN

queries via sequential scan, in terms of both compu-

tation and I/O costs. Thus, we use a multidimensional

index I (e.g. R-tree [12], M-tree [10], or SR-tree [14]) to

facilitate PRNN query processing over uncertain database

D. Since our proposed method is independent of the

underlying index, in this paper, we use one of popular

indexes, R-tree, to index uncertain objects (i.e. their

uncertainty regions), which has been extensively used

in the literature of RNN on precise data in Euclidean

space [18,33,36,38,39] or uncertain query processing

[20,21,23,24,28]. The framework for our PRNN search

is as follows. Given any query object q, we ﬁrst tra-

verse the R-tree I with an eﬀective pruning method to

reduce the PRNN search space, obtaining a candidate

set, and then reﬁne the candidates by checking their ac-

tual probabilities P

P RN N

(q, o) in Inequality (1). Those

objects with probabilities greater than or equal to α are

returned as the query result.

4.1 Heuristics of GP Method

In this subsection, we show the heuristics of our pruning

method, namely geometric pruning (GP), that permits

the development of eﬃcient PRNN search procedures.

Recall that, a PRNN query retrieves all the data ob-

jects that have query object q as their NNs with prob-

abilities greater than or equal to α ∈ (0, 1]. Thus, our

basic pruning idea is to discard those objects that have

probabilities never greater than β, for some β ∈ [0, α).

For brevity, we denote such a pruning method as GP

which, based on the PRNN deﬁnition, would not intro-

duce any false dismissals (actual answers to the query

that are however not in the retrieval result).

Note that, a special case of GP

, GP

when β = 0,

will ﬁlter out data objects that are deﬁnitely not RNNs

(i.e. with zero probability to be RNNs) of a query point

q. In fact, this case corresponds to applications where

we do not know the distributions of data objects within

their uncertainty regions. In order to guarantee no-

false-dismissal property of the query result, our GP

method has to conservatively prune data objects based

on the geometric property (e.g. shape) of their uncer-

tainty regions, assuming objects can locate anywhere in

regions and with any distributions. On the other hand,

in applications where object distributions are known in

advance, our GP

method (β ∈ [0, α)) can wisely uti-

lize this additional information to enhance the pruning

ability. Since GP

(β > 0) essentially applies techniques

similar to GP

, in the sequel, we will start with GP

and later generalize GP

to GP

(β ∈ [0, α)) in Section

4.5.

(a)

(b)

Fig. 2 2D Example of GP

Heuristics

Consider a query point q and any uncertain object o

in a 2D space as shown in Fig. 2(a), where o can locate

anywhere in its uncertainty region UR(o) with any dis-

tribution. Let o

be one possible position of uncertain

object o. The perpendicular bisector ⊥ (q, o

) between

points q and o

would partition the entire data space

into two halfplanes, HP

(q, o

) that contains point o

and HP

(q, o

) that contains point q. For any data ob-

ject p that completely falls into HP

(q, o

) (i.e. the

shaded area), we have dist(o

, p) < dist(q, p) (i.e. q is

not NN of p). Therefore, p cannot be the RNN of q [38].

剩余19页未读，继续阅读

weixin_38538585

粉丝: 3
资源: 956

概率反向最近邻查询：不确定数据的高效处理

论文研究-基于不确定数据的top-k概率相互最近邻查询.pdf

具有马尔可夫相关性的不确定数据的区间反向最近邻查询

大k不确定数据概率RkNN查询的高效算法

在使用R树索引的空间数据库中，如何设计一个高效的反向最近邻查询算法？请提供算法细节和性能分析。

如何在R树索引的空间数据库中实现高效的最近邻查询？请结合R树的特性详细说明查询过程。

用MappedByteBuffer处理FileChannel数据 反向解析

关联查询中正向查询与反向查询有什么区别

数据库的反向模糊查询

消息队列常用于 事务处理 流量削峰 反向代理 数据持久化

swiper反向加载数据

最新资源

用MappedByteBuffer处理FileChannel数据反向解析

消息队列常用于事务处理流量削峰反向代理数据持久化