from uncleandata under possible worlds semantics. Methods
are also proposed to derive probabilities of uncertain items.
One of the key aspects of the Conquer project is that it permits
real time and dynamic data cleaning in such a way that clean
and consistent answers may be obtained for queries. Another
example of such a databaseis the Orion project [25],[28] which
presents query processing and indexing techniques in order
to manage uncertainty over continuou s inter vals. Such
application-specific databases are designed for their corre-
sponding domain, and are not very effective in extracting
information from “possible worlds” semantics.
A recent and interesting line of models for uncertain data
is derived from the Trio project [16], [62], [34] at Stanford
University. This work introduces the concept of Uncertainty-
Lineage Database (ULDB), which is a database with both
uncertainty and lineage. We note that the introduction of
lineage as a first-class concept within the database is a novel
concept which is useful in a variety of applications such as
query processing. The basic idea in lineage is that the model
keeps track of the sources from which the data was acquired
and also keeps track of its influence in the database. Thus,
database with lineage can link the query results (or the
results from any potential application) to the source from
which they were derived. The probabilistic influence of the
data source on the final result is an important factor which
should be accounted for in data management applications.
Thus, data (or results) which are found to be unreliable are
discarded.
Finally, a recent effort is the MayBMS project [4], [5], [6] at
Cornell University. One advantage of this system is that it
fits seamlessly into modern database systems. For example,
this approach has a powerful query language which was
built on top of PostgreSQL. Another unique feature of the
system is that it uses the concept of U-relations in order to
maximize space-efficiency. Space-efficiency is a critical
feature in uncertain database systems, since the uncertainty
results in considerable expansion of the underlying database
representation. Details of the most recent approach may be
found in [6].
2.4 Extensions to Semistructured and XML Data
Recently, uncertain data models have also been extended to
semistructured and XML data. Some of the earliest work on
probabilistic semistructured data may be found in [66]. XML
data poses numerous unique challenges. Since XML is
structured, the probabilities need to be assigned to the
structural components such as nodes and links. Furthermore,
element probabilities could occur at multiple levels and
nested probabilities within a subtree must be considered.
Furthermore, incomplete data should be handled gracefully
since one may not insist on having complete probability
distributions. In order to handle the issue that there can be
nesting of XML elements, probabilities are associated with
the attribute values of elements in an indirect way. The
approach is to modify the schema in XML so as to make any
attribute into a subelement. Thus, these new elements can be
handled by the probabilistic system. Another unique issue in
the case of XML data is that the probabilities in an ancestor-
descendent chain are related probabilistically.
In the most general case, this can lead to issues of
computational intractability. The approach in [66] is to
model some classes of dependence (e.g., mutual exclusion)
which are useful and efficient to model. The work in [66]
also designs techniques for a restricted class of queries on
the data. Another interesting approach to probabilistic XML
data construction has been discussed in [50]. In this
technique, probabilistic XML trees are constructed in order
to model the structural behavior of the data. The un-
certainty in a probabilistic tree is modeled by introducing
two kinds of nodes: 1) probability nodes, which enumerate
all possibilities, and 2) possibility nodes, which have an
associated probability. The uncertainty in the different
kinds of nodes is modeled with the use of the kind function,
which assigns node kinds. Furthermore, a prob function is
used, which assigns probabilities to nodes. The query
evaluation technique enumerates all possible worlds in a
recursive manner. The query is then applied to each such
enumerated world. Other related work on XML data
representation and modeling may be found in [79].
3UNCERTAIN DATA MANAGEMENT APPLICATIONS
In this section, we will discuss the design of a number of
data management applications with uncertain data. These
includ e applications such as query processing, Online
Analytical Processing, selectivity estimation, indexing, and
join processing. We will provide an overview of the
application models and algorithms in this section.
3.1 Query Processing of Uncertain Data
In traditional database management, queries are typically
represented as SQL expressions which are then executed on
the database according to a query plan. As we will see, the
incorporation of probabilistic information has considerable
effects on the correctness andcomputability ofthe query plan.
3.1.1 Intensional and Extensional Semantics
A given query over an uncertain database may require
computation or aggregation over a large number of
possibilities. In some cases, the query may be nested, which
greatly increases the complexity of the computation. There
are two broad semantic approaches used:
. Intensional semantics. This typically models the
uncertain database in te rms of an event model
(which defines the possible worlds), and use tree-
like structures of inferences on these event combina-
tions. This tree-like structure enumerates all the
possibilities over which the query may be evaluated
and subsequently aggregated. The tree-like enumera-
tion results in an exponential complexity in evalua-
tion time, but always yields correct results.
. Extensional semantics. Extensional semantics at-
tempts to design a plan which can approximate
these queries without having to enumerate the entire
tree of inferences. This approach treats uncertainty
as a generalized truth value attached to formulas,
and attempts to evaluate (or approximate) the
uncertainty of a given formula based on that of its
subformulas.
For the intensional case, the key is to develop a probabil-
istic relational algebra with intensional semantics which
always yields correct results. It has been shown in [32] that
certain queries have #P-complete data complexity under
intensional semantics. Note that the extensional semantics
AGGARWAL AND YU: A SURVEY OF UNCERTAIN DATA ALGORITHMS AND APPLICATIONS 611