高效数据挖掘：功能依赖性发现的算法比较与优化

9 浏览量更新于2024-08-29 收藏 965KB PDF 举报

"从数据中发现功能依赖性"是一项关键的IT研究任务，对于知识发现、机器学习以及数据质量评估等领域具有重要意义。本文主要探讨了如何从现有的数据库中有效识别和挖掘功能依赖关系（Functional Dependencies, FDs）。功能依赖是数据库设计中的基本概念，它描述了一个关系模式中属性之间的约束关系，即一个属性或属性组完全取决于另一个属性或属性组。文献中已经提出了多种算法来解决这个问题，包括但不限于统计方法、模式挖掘技术、以及基于哈希的数据结构等。作者李吉学、叶飞岳、李九勇和王俊虎在本文中对这些算法进行了深入的回顾和比较，旨在揭示它们各自的优势和差异。他们特别关注了时间效率和空间效率，因为这两方面在大规模数据处理中至关重要。作者重点介绍了一种新颖的、基于哈希的算法，其特点是简单且在执行速度和存储需求上表现出色。通过对三种近期发表的算法进行性能对比，结果显示他们的哈希基算法在整体表现上更胜一筹。研究者分析了这种优势的原因，可能是由于哈希算法在处理大量数据时能有效减少计算复杂性和内存占用，从而实现了高效的FD发现。此外，文章还涵盖了算法的接受日期、修订日期和最终接受日期，以及在线发表的时间，表明该研究经过了严谨的科研流程，并及时分享了最新的研究成果。本文的贡献不仅在于提出新的算法，还在于通过实证分析提供了关于不同方法在实际应用中的性能评估，这对于数据库设计者、数据科学家和机器学习工程师来说，是一份有价值的研究参考资料。

Deﬁnition 2.3. Minimal FD, cover, and minimal cover

Let Σ be a set of FDs. An FD X→ A in Σ is minimal (minimal FD or irreducible) if there does not exist another FD Y→ A ∈ Σ such

that Y⊂ X.

A subset Σ′ of Σ is a cover of Σ if all FDs in Σ are either in Σ′ explicitly or are implied by the FDs in Σ′. A cover is minimal (minimal

cover) if all its FDs are minimal and no FD in Σ′ is implied by other FDs of Σ′ (we assumed that FDs have single attributes in rhs). □

We note that all FDs with single attributes on the lhs are minimal. Σ may have multiple covers and multiple minimal covers.

All FDs of a cover are implied by the FDs of another cover and this is true for minimal covers too. The concept of cover of this paper

is called redundant cover in [16].

The definition implies that FD X → A is reducible if Y→ A and Y is a subset of X or Y→ (X \Y) in the lhs of the FD. Y→ (X \Y)

reduces X→ A to Y → A.

Assume

Σ={AC → D,A→ B,B → A,BC → D,ABC→ D} is a set of all FDs holding on r. Then, the FD ABC → D is not minimal because

of AC → D in Σ. The set Σ′ ={AC→ D,A → B,B → A,BC→ D} is a cover because ABC → D is implied by AC → D. The sets Σ

′

AC→D; A→B; B→Afgand Σ

′

¼ BC→D; A→B; B→Afgare two minimal covers of Σ because each can derive all FDs of Σ. Σ

′

and Σ

′

are equivalent because all the FDs of one can be derived from the FDs of the other.

In the literature, there are two approaches for discovering FDs from a relation [23]: the top-down approach and the bottom-up

approach. We firstly review the major results of the top-down approach.

The top-down approach, employed by TANE [8],FD_Mine[26],andFUN[22], generates candidate FDs (canFDs), the ones syntactically

possible with regard to the attributes of the relation, and then checks the canFDs against the relation for satisfaction. The canFDs satisfied

by the relation are the FDs discovered.

To generate candidate FDs, an attribute lattice [9] is used. An attribute lattice is a directed graph with the root node,

represented by ϕ, at Level-0 containing no attribute. At Level-1, each node contains one attribute. At Level-2, each node contains

two distinct attributes. Let n

represent the j-th node at Level-i and also the attribute set of the node. A directed edge is drawn

from the j-th node at Level-i to the k-th node at Level-(i+1) if n

⊂ n

(i+1)k

. Each edge represents the canFD n

→ (n

(i+1)k

− n

). n

is the parent and n

(i+1)k

is the child. The canFD is said from the parent and to the child. Fig. 1 shows a lattice of R ={A,B,C,D} where

downward arrows are omitted from the edges, the edge between the node AB at Level-2 and the node ABC at Level-3 represents

the canFD AB→ C. We use L

to denote all the nodes at Level-i.

An attribute lattice has 2

|R|

nodes (the number of nodes equals to the sum of the numbers in the |R| -th row of Pascal triangle)

and |R|2

|R|−1

edges (this can be obtained by using the symmetric property of the lattice). See [17,18] for details.

Testing the satisfaction of a canFD can follow Definition 2.1. However the way used in the literature uses attribute partitions

[2] defined below.

Deﬁnition 2.4. Attribute partition

Let r be a relation over a set of attributes R. r contains the special attribute tid (for tuple identifier) which uniquely identifies

the tuples in r.Anattribute partition (partition for short) of r with regard to Xp R is the set π

={c

,⋯,c

} where

▪ each c

∈ π

is a set of the tids of all the tuples having equal X value and is called an equivalence class,

▪ n is the number of distinct values in the projection r[X], called the class count, and

▪ c

∪⋯∪ c

=r[tid], and ∀ c

∈ [c

,⋯,c

] (if i≠ j,c

∩ c

=ϕ).

□

In Table 1, π

={{t

},{t

}}, π

={{t

},{t

}}, π

={{t

},{t

}}, π

={{t

},{t

}}.

ABCD

AC AD BC BD CD

ABC ABD ACD BCD

ABCD

L-0

L-1

L-2

L-3

L-4

Fig. 1. An attribute lattice.

148 J. Liu et al. / Data & Knowledge Engineering 86 (2013) 146–159

剩余13页未读，继续阅读

weixin_38628183

粉丝: 6
资源: 889

高效数据挖掘：功能依赖性发现的算法比较与优化

Java程序分层及概率依赖性探究.pdf

SAP升级依赖性分析软件介绍

uds诊断的依赖性校验

web应用测试用例依赖性分析

我有一批功能和功能的描述信息，如何处理或挖掘他们，才能得到有价值的数据

元数据管理系统基本功能

输出医疗数据可视化系统的现状分析、业务目标、系统功能概述、用户特点、约束、假设和依赖

bi 主数据 元数据

图书销售与管理系统数据据库的系统功能结构

最新资源

bi 主数据元数据