无需候选集的频繁模式FP-tree算法详解

4星 · 超过85%的资源需积分: 10 152 浏览量更新于2024-08-01 收藏 214KB PDF 举报

在数据挖掘和知识发现领域，频繁模式挖掘是研究的重要组成部分，特别是在交易数据库、时间序列数据库和其他类型的数据存储中。经典的Apriori算法通过生成候选集的方式寻找频繁项集，但这种策略在处理大量模式或长模式时效率较低，因为候选集生成成本较高。本文主要介绍了一种名为FP-Tree（Frequent-Pattern Tree）的新颖方法，它作为一种扩展前缀树结构，旨在解决频繁模式挖掘中的候选集生成问题。FP-Tree通过压缩和高效地存储频繁模式的关键信息，显著降低了查找频繁模式的时间复杂性。与传统的Apriori方法不同，FP-Tree避免了逐个生成和测试候选集的过程，从而实现了更高效的频繁模式挖掘。 FP-Tree的设计原理是将频繁模式分解为其组成部分，每个节点代表一个频繁项集的子集，并且在树中保留了模式的频率信息。这使得树结构能够快速地进行模式匹配和剪枝操作，减少了不必要的搜索空间。它的核心优势在于： 1. **模式压缩**：通过共享节点和路径，FP-Tree可以减少存储空间，特别是对于包含公共元素的频繁模式，其存储效率更高。 2. **模式结构化**：FP-Tree以树形结构组织频繁模式，使得模式间的关联性和冗余信息得以清晰展现，便于分析和发现潜在的关联规则。 3. **高效查询**：由于树的搜索特性，FP-Tree在查询频繁模式时具有较高的时间复杂度优势，尤其是在频繁模式数量众多或者模式长度较长的情况下。 4. **避免候选集生成**：FP-Tree在构建过程中直接基于当前频繁项集扩展，避免了繁琐的候选集生成步骤，从而减少了计算开销。作者们Jia Wei Han、Jianpei Cao和Yi Wen Yin、Runying Mao分别来自伊利诺伊大学厄巴纳-香槟分校、布法罗州立大学和微软公司，他们在论文中详细探讨了FP-Tree的构造方法、维护机制以及与现有算法的比较。这项工作不仅提高了数据挖掘的性能，还为处理大规模和复杂数据集的频繁模式挖掘提供了一个有前景的新途径。不生成候选集的频繁模式数据挖掘，通过FP-Tree这一创新方法，革新了传统频繁模式挖掘技术，为实际应用中的数据探索和知识发现带来了显著的效率提升。未来的研究可能进一步优化FP-Tree结构，探索更多领域的应用潜力。

MINING FREQUENT PATTERNS WITHOUT CANDIDATE GENERATION

cost of inserting a transaction Trans into the FP-tree is O(|freq(Trans)|), where freq(Trans)

is the set of frequent items in Trans.Wewill show that the FP-tree contains the complete

information for frequent-pattern mining.

2.2. Completeness and compactness of FP-tree

There are several important properties of FP-tree that can be derived from the FP-tree

construction process.

Given a transaction database DB and a support threshold ξ . Let F be the frequent items in

DB.For each transaction T , freq(T )isthe set of frequent items in T , i.e., freq(T ) = T ∩ F,

and is called the frequent item projection of transaction T . According to the Apriori

principle, the set of frequent item projections of transactions in the database is sufﬁcient

for mining the complete set of frequent patterns, because an infrequent item plays no role

in frequent patterns.

Lemma 2.1. Given a transaction database DB and a support threshold ξ, the complete

set of frequent item projections of transactions in the database can be derived from DB’s

FP-tree.

Rationale. Based on the FP-tree construction process, for each transaction in the DB, its

frequent item projection is mapped to one path in the FP-tree.

Forapath a

...a

from the root to a node in the FP-tree, let c

be the count at the

node labeled a

and c



be the sum of counts of children nodes of a

. Then, according to

the construction of the FP-tree, the path registers frequent item projections of c

− c



transactions.

Therefore, the FP-tree registers the complete set of frequent item projections without

duplication.

Based on this lemma, after an FP-tree for DB is constructed, it contains the complete

information for mining frequent patterns from the transaction database. Thereafter, only the

FP-tree is needed in the remaining mining process, regardless of the number and length of

the frequent patterns.

Lemma 2.2. Given a transaction database DB and a support threshold ξ .Without con-

sidering the (null) root, the size of an FP-tree is bounded by



T ∈DB

|freq(T )|, and the

height of the tree is bounded by max

T ∈DB

{|freq(T )|}, where freq(T ) is the frequent item

projection of transaction T .

Rationale. Based on the FP-tree construction process, for any transaction T in DB, there

exists a path in the FP-tree starting from the corresponding item preﬁx subtree so that the set

of nodes in the path is exactly the same set of frequent items in T . The root is the only extra

node that is not created by frequent-item insertion, and each node contains one node-link

and one count. Thus we have the bound of the size of the tree stated in the Lemma.

The height of any p-preﬁx subtree is the maximum number of frequent items in any

transaction with p appearing at the head of its frequent item list. Therefore, the height of

剩余34页未读，继续阅读

jetlee1986

粉丝: 0
资源: 10

无需候选集的频繁模式FP-tree算法详解

基于SQL的不产生候选集的频繁模式挖掘.pdf

频繁模式挖掘.docx

数据挖掘使用Apriori算法找出数据集中的频繁项集

Apriori算法对products数据集前一百个商品挖掘频繁集输出强关联

简述序列模式挖掘的一般步骤以及各个阶段的主要任务

apriori数据挖掘算法

数据挖掘之apriori算法

数据挖掘 Apriori 算法

数据挖掘apriori算法c++

最新资源