
Top-k Query Processing in Uncertain Databases
Mohamed A. Soliman Ihab F. Ilyas
School of Computer Science
University of Waterloo
{m2ali,ilyas}@uwaterloo.ca
Kevin Chen-Chuan Chang
Department of Computer Science
University of Illinois at UrbanaChampaign
kcchang@cs.uiuc.edu
Abstract
Top-k processing in uncertain databases is semanti-
cally and computationally different from traditional top-k
processing. The interplay b etween score and uncertainty
makes traditional techniques inapplicable. We introduce
new probabilistic formulations for top-k queries. Our for-
mulations are based on “marriage” of traditional top-k se-
mantics and possible worlds semantics. In the light of these
formulations, we construct a framework that encapsulates a
state space model and efficient query processing techniques
to tackle the challenges of uncertain data settings. We prove
that our techniques are optimal in terms of the number of
accessed tuples and materialized search states. Our exper-
iments show the efficiency of our techniques under different
data distributions with orders of magnitude improvement
over na
¨
ıve materialization of possible worlds.
1 Introduction
Efficient processing of uncertain (probabilistic) data is a
crucial requirement in different domains including sensor
networks [21, 8], moving objects tracking [7, 9] and data
cleaning [12, 3]. Several probabilistic data models have
been proposed, e.g., [10, 11, 4, 15, 16, 2 0, 2], to capture data
uncertainty at different levels. According to most of these
models, tuples have membership probability, e.g., based on
data source reliability [13], or fuzzy query predicates, a s ad-
dressed in [18]. Tuple attributes could a lso contain multiple
values drawn from discrete or continuous domains [20, 6],
e.g., a set of possible customer names in a dirty database, or
an interval of possible sensor readings.
Many uncertain data models, e.g., [15, 2, 20], adopt
possible worlds semantics, where an uncertain relation is
viewed as a set of possible instances (worlds). The struc-
ture of these worlds could be governed by underlying gen-
eration rules, e.g., mutual exclusion of tuples that represent
the same real-world entity. Such rules could arise naturally
with unclean data [13, 3], or could be customized to en-
force application requirements, reflect domain semantics, or
maintain lin eage and data inter-dependencies [ 5, 20]. The
Figure 1. Uncertain Database and Possible
Worlds Space
following is a running example for an uncertain database
that we use in this paper.
Example 1 Consider a radar-controlled traffic, where car
speed readings are stored in a database. Radar units detect
speed automatically, while car identification ,e.g., by plate
number, is usually performed by a human operator. In this
database, multiple sources of errors (uncertainty) exist. For
example, radar readings can be interfered by high voltage
lines, close by cars cannot be precisely distinguished, or
human operators might make identification mistakes. Fig-
ure 1(a) is a snapshot of a radar database in the last hour.
Each reading is associated with a confidence field “conf”
indicating its membership probability. Based on radar lo-
cations, the same car cannot be detected by radars at two
different locations within 1 hour interval. This constraint is
captured by the indicated exclusiveness rules.
1.1 Uncertain Data Model
The uncertain data model we adopt is based on possi-
ble worlds semantics with two pillars. The first pillar is
membership uncerta inty, where each tuple belongs to the
database with a probability, hereafter called confidence.
The second p illar is generation rules, which are arbitrary
logical formulas that determine valid worlds. Tuples that
are correlated with no rules are called independent.
1-4244-0803-2/07/$20.00 ©2007 IEEE. 896