however, with attribute uncertainty (that is, the attribute
value of each dimension is imprecise). Query processing
over such uncertain data thus cannot directly use traditional
methods that are designed for precise data. Instead, as
studied by previous works, different query types have to be
redesigned in order to answer queries on uncertain data
with confidence. That is, we need to retrieve query answers
over uncertain data with probability greater than or equal to
a probabilistic threshold. To list a few, the existing works
include the range query [9], [32], nearest neighbor query [8], [9],
[20], skyline query [25], reverse skyline query [22], and
similarity join [19]. In the conference version of this work,
Lian and Chen [23] studied the ranked query in the
uncertain database with linear preference function. In this
long version, we generalize our proposed approaches to
answering the PRank query with arbitrary monotonically
increasing preference functions. Furthermore, we also
propose a J-PRank query over two uncertain databases,
useful for applications like data integration.
3PROBLEM DEFINITION
In this section, we formally define the problem of the
probabilistic ranked query (PRank). In particular, assume that
we have a static uncertain database D in a d-dimensional space
in which each uncertain object OðO
1
;O
2
; ...;O
d
Þ can reside
anywhere within an (hyperspherical) u ncertainty region
URðOÞ [8], [32] centered at point C
O
with radius r
O
. Let
pdfðUÞ be the probability density function (pdf) with respect to
the location that object U appears. We have pdfðUÞ2½0; 1,if
U 2 URðUÞ; pdfðUÞ¼0, otherwise. Following the conven-
tion [9], [8], [25], we assume that all the data objects are
independent of each other in the database D . The problem of
retrieving the PRank query results is defined as follows:
Definition 3.1 (Probabilistic k-ranked query, k-PRank).
Assume that we have an uncertain database D, a user-specified
monotonic preference function fðÞ, and an integer k. For
1 m k, we define the m-ranking probability Pr
m
ðOÞ of
object O 2Das
Pr
m
ðOÞ¼
Z
s
2
s
1
PrffðOÞ¼sg
X
8fP
1
;P
2
;...;P
m1
g2DnfOg
Y
m1
i¼1
PrffðP
i
Þsg
Y
8P
j
2DnfO;P
1
;...;P
m1
g
PrffðP
j
Þsg
!!
ds;
ð1Þ
where s
1
and s
2
are the lower and upper bounds of score fðOÞ
for object O, respectively. A k-PRank query retrieves
k uncertain objects OR
1
;OR
2
; ...;OR
m
; ...;OR
k
ð2 DÞ such
that object OR
m
has the highest m-ranking probabil ity
Pr
m
ðOR
m
Þ among all data objects in D.
Intuitively, (1) defines the expected probability Pr
m
ðOÞ
(that is, m-ranking probability)thatobjectO has the
mth largest score in the database D. In particular, when
the score fðOÞ of object O is s 2½s
1
;s
2
, we consider all
possible cases where there are exactly (m 1) objects
P
1
;P
2
; ..., and P
m1
in DnfOg having higher scores than s
(that is, higher ranks than O), while the other objects P
j
2
DnfO; P
1
; ...;P
m1
g have lower scores than object O. Thus,
as shown in (1), for each possible combination of P
1
;P
2
; ...,
and P
m1
, we calculate the probability that O has the mth
highest score by multiplying probabilities that objects have
either higher or lower scores than s (due to the object
independence [9], [8], [25]). Finally, we integrate the
probability summation for all these combinations on s,
and obtain the expected probability that O has the mth rank.
After defining the m-ranking probability, the problem of
the PRank query is to retrieve object OR
1
that has the
highest score with the highest probability Pr
1
ðOR
1
Þ among
all the objects in D; object OR
2
that has the second highest
score with the highest probability Pr
2
ðOR
2
Þ;...; and object
OR
k
that has t he kth highest score with the largest
probability Pr
k
ðOR
k
Þ. Intuitively, we consider PRank with
the semantics that retrieve the most probable uncertain
object for each rank m from 1 to k. Note that there might be
some other interesting semantics. For example, retrieve
uncertain objects with the highest probabilities of having
ranks within ½1;k, where the probability is defined as the
summation of our m-ranking probabilities on m 2½1;k.
That is, obtain k objects that are most likely to be in the top- k
list (note that it is possible, however, that some top-k objects
do not appear in the result at the same time in practice).
Nevertheless, in this work, we will only focus on PRank and
leave other semantics as our future work.
Next, we formalize a novel query type, namely, probabil-
istic ranked query on join (J-PRank), on two uncertain
databases.
Definition 3.2 (Probabilistic ranked query on join,
J-PRank). Assume that we have two uncertain databases A
and B, a user-specified monotonic preference function fðÞ,an
integer k, and a join predicate IP . A J-PRank query retrieves
k PRank objects on the join of two uncertain databases, that is,
Affl
IP
B¼fðX; Y ÞjX ffl
IP
Y;X2A;Y 2Bg; ð2Þ
where the definition of PRank objects refers to Definition 3.1.
From Definition 3.2, we can see that the J-PRank query
retrieves PRank results on the join of two uncertain
databases ra ther than one sing le database. Thus, th e
processing of J-PRank queries is more complex and
challenging than that of PRank. In particular, we have to
consider the join predicate between object pairs from two
databases. For simplicity, in this paper, we consider the join
predicate IP as a similarity predicate on uncertain data [19],
that is, PrfdistðX;Y Þ"g, where distð; Þ is a eucli-
dean distance function, and " and are the distance and
probability thresholds specified by the join predicate.
Nevertheless, our proposed approaches in this work are
not sensitive to a specific predicate, and thus, can be easily
extended to other join predicates as well. We would like to
leave it as our future work.
Since previous approaches are designed only for the
ranked query processing over precise objects, they are not
suitable for handling uncertain data. Thus, the only
straightforward method to answer PRank queries is prob-
ably the linear scan. That is, we sequentially scan all the
LIAN AND CHEN: RANKED QUERY PROCESSING IN UNCERTAIN DATABASES 423