处理不确定数据库中排名查询的新方法

105 浏览量更新于2024-07-15 收藏 2.87MB PDF 举报

"在不确定数据库中处理排名查询" 近年来，随着传感器数据监控和移动设备追踪等新型应用的兴起，不确定数据管理的问题变得日益重要。与“确定性”数据不同，不确定数据库中的数据不是精确的点，而常常存在于一个区域内。本文专注于研究不确定数据上的排名查询。实际上，由于在决策制定、推荐系统和数据挖掘任务等许多应用中的广泛需求，排名查询已经在传统数据库文献中得到了深入研究。许多提议旨在提高回答排名查询的效率。然而，现有的方法都是基于数据精确（或确定）的前提，由于不确定数据与确定数据的本质差异，这些方法仅适用于确定性数据库中的排名查询，无法直接应用于不确定场景。为了应对这一挑战，我们提出了针对概率排名查询（PRank）的新颖解决方案，旨在加速处理不确定数据上的排名查询。PRank是一种处理不确定数据的查询方法，它考虑了数据的不确定性并根据概率对结果进行排序。在不确定数据库中，数据的不确定性可能导致同一查询返回多种可能的结果集，每个结果集都有其出现的概率。因此，PRank的目标是不仅返回结果，还要根据其出现概率对结果进行排名。传统的排名查询优化技术主要关注于确定性数据的索引和查询计划优化，而在不确定数据中，我们需要处理数据的不确定性范围和概率分布。为此，我们提出了一种名为J-PRank的方法，它扩展了PRank以处理涉及多个表的连接查询。J-PRank通过智能地合并不同数据源的不确定性并考虑它们的联合概率分布，提高了查询性能。在J-PRank中，我们设计了一种新的索引结构，称为不确定数据的概率索引，它可以有效地存储和检索具有不确定性的数据项。这种索引允许快速定位和评估可能的结果，减少了计算成本。此外，我们还开发了一种优化查询计划的策略，该策略在选择连接顺序和操作符时考虑了不确定性的影响，以最大化查询效率。实验结果表明，我们的J-PRank方法在处理不确定数据库中的排名查询时，相比于现有技术，显著提高了查询速度和资源利用率。同时，我们的方法在保持结果准确性和概率排序质量方面表现优秀。通过这些贡献，我们为不确定数据管理提供了一种更强大且适应性强的工具，为未来不确定数据库的应用和开发奠定了基础。

however, with attribute uncertainty (that is, the attribute

value of each dimension is imprecise). Query processing

over such uncertain data thus cannot directly use traditional

methods that are designed for precise data. Instead, as

studied by previous works, different query types have to be

redesigned in order to answer queries on uncertain data

with confidence. That is, we need to retrieve query answers

over uncertain data with probability greater than or equal to

a probabilistic threshold. To list a few, the existing works

include the range query [9], [32], nearest neighbor query [8], [9],

[20], skyline query [25], reverse skyline query [22], and

similarity join [19]. In the conference version of this work,

Lian and Chen [23] studied the ranked query in the

uncertain database with linear preference function. In this

long version, we generalize our proposed approaches to

answering the PRank query with arbitrary monotonically

increasing preference functions. Furthermore, we also

propose a J-PRank query over two uncertain databases,

useful for applications like data integration.

3PROBLEM DEFINITION

In this section, we formally define the problem of the

probabilistic ranked query (PRank). In particular, assume that

we have a static uncertain database D in a d-dimensional space

in which each uncertain object OðO

; ...;O

Þ can reside

anywhere within an (hyperspherical) u ncertainty region

URðOÞ [8], [32] centered at point C

with radius r

. Let

pdfðUÞ be the probability density function (pdf) with respect to

the location that object U appears. We have pdfðUÞ2½0; 1,if

U 2 URðUÞ; pdfðUÞ¼0, otherwise. Following the conven-

tion [9], [8], [25], we assume that all the data objects are

independent of each other in the database D . The problem of

retrieving the PRank query results is defined as follows:

Definition 3.1 (Probabilistic k-ranked query, k-PRank).

Assume that we have an uncertain database D, a user-specified

monotonic preference function fðÞ, and an integer k. For

1  m  k, we define the m-ranking probability Pr

ðOÞ of

object O 2Das

ðOÞ¼

PrffðOÞ¼sg



8fP

;...;P

m1

g2DnfOg

m1

i¼1

PrffðP

Þsg



2DnfO;P

;...;P

m1

PrffðP

Þsg

ds;

ð1Þ

where s

and s

are the lower and upper bounds of score fðOÞ

for object O, respectively. A k-PRank query retrieves

k uncertain objects OR

;OR

; ...;OR

ð2 DÞ such

that object OR

has the highest m-ranking probabil ity

ðOR

Þ among all data objects in D.

Intuitively, (1) defines the expected probability Pr

ðOÞ

(that is, m-ranking probability)thatobjectO has the

mth largest score in the database D. In particular, when

the score fðOÞ of object O is s 2½s

, we consider all

possible cases where there are exactly (m  1) objects

; ..., and P

m1

in DnfOg having higher scores than s

(that is, higher ranks than O), while the other objects P

DnfO; P

; ...;P

m1

g have lower scores than object O. Thus,

as shown in (1), for each possible combination of P

; ...,

and P

m1

, we calculate the probability that O has the mth

highest score by multiplying probabilities that objects have

either higher or lower scores than s (due to the object

independence [9], [8], [25]). Finally, we integrate the

probability summation for all these combinations on s,

and obtain the expected probability that O has the mth rank.

After defining the m-ranking probability, the problem of

the PRank query is to retrieve object OR

that has the

highest score with the highest probability Pr

ðOR

Þ among

all the objects in D; object OR

that has the second highest

score with the highest probability Pr

ðOR

Þ;...; and object

that has t he kth highest score with the largest

probability Pr

ðOR

Þ. Intuitively, we consider PRank with

the semantics that retrieve the most probable uncertain

object for each rank m from 1 to k. Note that there might be

some other interesting semantics. For example, retrieve

uncertain objects with the highest probabilities of having

ranks within ½1;k, where the probability is defined as the

summation of our m-ranking probabilities on m 2½1;k.

That is, obtain k objects that are most likely to be in the top- k

list (note that it is possible, however, that some top-k objects

do not appear in the result at the same time in practice).

Nevertheless, in this work, we will only focus on PRank and

leave other semantics as our future work.

Next, we formalize a novel query type, namely, probabil-

istic ranked query on join (J-PRank), on two uncertain

databases.

Definition 3.2 (Probabilistic ranked query on join,

J-PRank). Assume that we have two uncertain databases A

and B, a user-specified monotonic preference function fðÞ,an

integer k, and a join predicate IP . A J-PRank query retrieves

k PRank objects on the join of two uncertain databases, that is,

Aﬄ

B¼fðX; Y ÞjX ﬄ

Y;X2A;Y 2Bg; ð2Þ

where the definition of PRank objects refers to Definition 3.1.

From Definition 3.2, we can see that the J-PRank query

retrieves PRank results on the join of two uncertain

databases ra ther than one sing le database. Thus, th e

processing of J-PRank queries is more complex and

challenging than that of PRank. In particular, we have to

consider the join predicate between object pairs from two

databases. For simplicity, in this paper, we consider the join

predicate IP as a similarity predicate on uncertain data [19],

that is, PrfdistðX;Y Þ"g, where distð; Þ is a eucli-

dean distance function, and " and  are the distance and

probability thresholds specified by the join predicate.

Nevertheless, our proposed approaches in this work are

not sensitive to a specific predicate, and thus, can be easily

extended to other join predicates as well. We would like to

leave it as our future work.

Since previous approaches are designed only for the

ranked query processing over precise objects, they are not

suitable for handling uncertain data. Thus, the only

straightforward method to answer PRank queries is prob-

ably the linear scan. That is, we sequentially scan all the

LIAN AND CHEN: RANKED QUERY PROCESSING IN UNCERTAIN DATABASES 423

剩余16页未读，继续阅读

weixin_38646902

粉丝: 4
资源: 921

处理不确定数据库中排名查询的新方法

"宫水三叶的刷题日记：前缀和进阶指南

Oracle高效删除重复数据技巧分享

Solr 1.4企业级搜索服务器详测：英文版全面指南

Top Ranked Phrases in a Corpus-开源

Verifiable Ranked Search Over Dynamic Encrypted Data in Cloud Computing

ranked_list

candies_ranked

ADVANCES IN SIGNAL PROCESSING AND INTELLIGENT RECOGNITION SYSTEMS : 4th

ranked_0.pdb

spotify-ranked-playlists

最新资源