不确定数据算法与应用综述：模型、管理与挖掘挑战

4星 · 超过85%的资源需积分: 0 172 浏览量更新于2024-07-27 2 收藏 641KB PDF 举报

随着信息技术的发展，间接数据收集方法的普及使得不确定数据的处理和管理变得日益重要。不确定数据是指在现实世界中存在各种不确定性和模糊性的信息，这可能是由于设备限制（如传感器网络中的噪声）、隐私保护导致的不完整数据集（如人口统计数据中只能获取部分汇总信息，每个聚合记录用概率分布表示），或者是通过统计方法构建的数据属性（如预测模型带来的不确定性）。这些数据的复杂性在于需要同时处理和理解概率信息，这对数据库管理和挖掘带来了新的挑战。在这篇名为《不确定数据算法与应用的调查》的文章中，作者Charu Aggarwal和Philip S. Yu作为IEEE的资深会员和 Fellow，对不确定数据的挖掘和管理进行了深入探讨。他们首先介绍了不确定数据产生的背景和其在不同领域的应用，如物联网、商业智能、市场分析等，这些场景中，数据的准确性、精确度和一致性受到挑战。文章的重点在于概述了不确定数据的几种主要模型，包括概率模型（如贝叶斯网络、马尔科夫网络）、模糊逻辑模型（如Zadeh的模糊集）和区间模型（如区间数和模糊区间）。这些模型为不确定数据的表达提供了数学基础，使我们能够理解和处理数据中的不确定性。在数据库管理方面，研究者们关注了如何处理不确定数据的联接操作、查询处理、选择性估计、OLAP（在线分析处理）查询以及索引设计。例如，对于模糊查询，需要开发新的查询算法来处理不确定关键字；对于不确定性数据的排序和索引，需要考虑概率分布的影响，以提高查询效率。在不确定数据挖掘领域，文章涵盖了频繁模式挖掘、异常检测、分类和聚类等传统问题。对于频繁模式挖掘，可能需要处理基于概率的模式支持度计算；异常检测则需识别数据中的离群值，考虑到数据的不确定性；而在分类和聚类任务中，不确定性可能会导致传统的硬边界划分方法失效，因此需要发展适应模糊边界的算法。《不确定数据算法与应用的调查》这篇论文不仅揭示了不确定数据处理的现状，还提出了未来的研究方向，强调了在面对不确定性和模糊性时，如何设计更有效率和准确的数据管理方法以及挖掘技术。这对于数据科学家、数据库管理员和信息技术从业者来说，是一份宝贵的参考资料，帮助他们理解和应对现代数据环境中的复杂挑战。

from uncleandata under possible worlds semantics. Methods

are also proposed to derive probabilities of uncertain items.

One of the key aspects of the Conquer project is that it permits

real time and dynamic data cleaning in such a way that clean

and consistent answers may be obtained for queries. Another

example of such a databaseis the Orion project [25],[28] which

presents query processing and indexing techniques in order

to manage uncertainty over continuou s inter vals. Such

application-specific databases are designed for their corre-

sponding domain, and are not very effective in extracting

information from “possible worlds” semantics.

A recent and interesting line of models for uncertain data

is derived from the Trio project [16], [62], [34] at Stanford

University. This work introduces the concept of Uncertainty-

Lineage Database (ULDB), which is a database with both

uncertainty and lineage. We note that the introduction of

lineage as a first-class concept within the database is a novel

concept which is useful in a variety of applications such as

query processing. The basic idea in lineage is that the model

keeps track of the sources from which the data was acquired

and also keeps track of its influence in the database. Thus,

database with lineage can link the query results (or the

results from any potential application) to the source from

which they were derived. The probabilistic influence of the

data source on the final result is an important factor which

should be accounted for in data management applications.

Thus, data (or results) which are found to be unreliable are

discarded.

Finally, a recent effort is the MayBMS project [4], [5], [6] at

Cornell University. One advantage of this system is that it

fits seamlessly into modern database systems. For example,

this approach has a powerful query language which was

built on top of PostgreSQL. Another unique feature of the

system is that it uses the concept of U-relations in order to

maximize space-efficiency. Space-efficiency is a critical

feature in uncertain database systems, since the uncertainty

results in considerable expansion of the underlying database

representation. Details of the most recent approach may be

found in [6].

2.4 Extensions to Semistructured and XML Data

Recently, uncertain data models have also been extended to

semistructured and XML data. Some of the earliest work on

probabilistic semistructured data may be found in [66]. XML

data poses numerous unique challenges. Since XML is

structured, the probabilities need to be assigned to the

structural components such as nodes and links. Furthermore,

element probabilities could occur at multiple levels and

nested probabilities within a subtree must be considered.

Furthermore, incomplete data should be handled gracefully

since one may not insist on having complete probability

distributions. In order to handle the issue that there can be

nesting of XML elements, probabilities are associated with

the attribute values of elements in an indirect way. The

approach is to modify the schema in XML so as to make any

attribute into a subelement. Thus, these new elements can be

handled by the probabilistic system. Another unique issue in

the case of XML data is that the probabilities in an ancestor-

descendent chain are related probabilistically.

In the most general case, this can lead to issues of

computational intractability. The approach in [66] is to

model some classes of dependence (e.g., mutual exclusion)

which are useful and efficient to model. The work in [66]

also designs techniques for a restricted class of queries on

the data. Another interesting approach to probabilistic XML

data construction has been discussed in [50]. In this

technique, probabilistic XML trees are constructed in order

to model the structural behavior of the data. The un-

certainty in a probabilistic tree is modeled by introducing

two kinds of nodes: 1) probability nodes, which enumerate

all possibilities, and 2) possibility nodes, which have an

associated probability. The uncertainty in the different

kinds of nodes is modeled with the use of the kind function,

which assigns node kinds. Furthermore, a prob function is

used, which assigns probabilities to nodes. The query

evaluation technique enumerates all possible worlds in a

recursive manner. The query is then applied to each such

enumerated world. Other related work on XML data

representation and modeling may be found in [79].

3UNCERTAIN DATA MANAGEMENT APPLICATIONS

In this section, we will discuss the design of a number of

data management applications with uncertain data. These

includ e applications such as query processing, Online

Analytical Processing, selectivity estimation, indexing, and

join processing. We will provide an overview of the

application models and algorithms in this section.

3.1 Query Processing of Uncertain Data

In traditional database management, queries are typically

represented as SQL expressions which are then executed on

the database according to a query plan. As we will see, the

incorporation of probabilistic information has considerable

effects on the correctness andcomputability ofthe query plan.

3.1.1 Intensional and Extensional Semantics

A given query over an uncertain database may require

computation or aggregation over a large number of

possibilities. In some cases, the query may be nested, which

greatly increases the complexity of the computation. There

are two broad semantic approaches used:

. Intensional semantics. This typically models the

uncertain database in te rms of an event model

(which defines the possible worlds), and use tree-

like structures of inferences on these event combina-

tions. This tree-like structure enumerates all the

possibilities over which the query may be evaluated

and subsequently aggregated. The tree-like enumera-

tion results in an exponential complexity in evalua-

tion time, but always yields correct results.

. Extensional semantics. Extensional semantics at-

tempts to design a plan which can approximate

these queries without having to enumerate the entire

tree of inferences. This approach treats uncertainty

as a generalized truth value attached to formulas,

and attempts to evaluate (or approximate) the

uncertainty of a given formula based on that of its

subformulas.

For the intensional case, the key is to develop a probabil-

istic relational algebra with intensional semantics which

always yields correct results. It has been shown in [32] that

certain queries have #P-complete data complexity under

intensional semantics. Note that the extensional semantics

AGGARWAL AND YU: A SURVEY OF UNCERTAIN DATA ALGORITHMS AND APPLICATIONS 611

剩余14页未读，继续阅读

susan93171

粉丝: 3
资源: 27

不确定数据算法与应用综述：模型、管理与挖掘挑战

Randomized Algorithms for Analysis and Control of Uncertain Systems

Managing and Mining Uncertain Data

Probabilistic Nearest Neighbor Queries of Uncertain Data via Wireless Data Broadcast

Robust solutions of LP contaminated with uncertain data.pdf

Fault detection for a class of uncertain linear discrete-time systems with intermittent measurements and probabilistic actuator failures

Adaptive finite-time consensus control of a group of uncertain nonlinear mechanical systems

Uncertain Data Envelopment Analysis 2015

Full-order observer-based actuator fault detection and reduced-order observer-based fault reconstruction for a class of uncertain nonlinear systems

Design of Robust Fuzzy Sliding-Mode Controller for a Class of Uncertain Takagi-Sugeno Nonlinear Systems

Optimization with uncertain data.pdf

最新资源