大数据并行增量频繁项集挖掘：IncMiningPFP与IncBuildingPFP算法优化

187 浏览量更新于2024-07-14 收藏 2.16MB PDF 举报

大数据并行增量频繁项目集挖掘是当前数据挖掘领域的重要研究课题，特别是在零售业的商品推荐、Web搜索日志分析以及查询推荐等应用场景中，对于处理大规模数据和实时数据库更新的需求日益增长。传统的频繁项集挖掘（FIM）算法虽然有效，但面对海量数据和频繁的增量更新，效率成为瓶颈。为了解决这一问题，研究者提出了在MapReduce框架下实现的两种并行增量FIM算法：IncMiningPFP和IncBuildingPFP。MapReduce是一种分布式计算模型，适用于大规模数据处理，通过将数据分割到多个节点并行处理，极大地提高了计算效率。 IncMiningPFP算法的核心思想是保留原始FPGrowth（频繁模式挖掘）方法生成的FP树的结果，并在增量阶段利用这些信息进行快速计算。它创新地设计了一种生成部分FP树的方法，这减少了在增量过程中的无用挖掘工作，尤其是当新增事务包含较少项目时，能够精简并行任务，显著提升在大型增量输入数据库上的性能。相比之下，IncBuildingPFP则侧重于保留原遍历中构建的CanTree（候选集树），在增量遍历阶段，它将新事务逐个添加到已有的CanTree中，这样可以更好地适应增量数据的变化。这种方法的优势在于对于频繁的增量数据，它可能比单纯基于FP树的算法更为高效，因为CanTree结构更易于维护和更新。论文《Parallel Incremental Frequent Itemset Mining for Large Data》由SongYG、CuiHM和FengXB等人在2017年的《计算机科学技术学报》上发表，他们详细阐述了这两种算法的设计原理、优化策略以及实验结果。结果显示，IncMiningPFP在处理大量增量数据时表现出色，相较于传统的PFP（并行FPGrowth）和顺序增量算法（如CanTree）有明显性能提升。而IncBuildingPFP在某些特定的增量输入场景下则展现出更好的适应性和效率。这篇文章的研究成果对于大数据时代下的频繁项集挖掘提供了有效的并行和增量解决方案，为实际应用中的实时数据分析和推荐系统提供了理论支持和技术手段。随着大数据的持续增长和实时性需求的增强，这种并行增量FIM技术的研究将更加重要。

Yu-Geng Song et al.: Parallel Incremental Frequent Itemset Mining for Large Data 371

in such a way that each machine executes an indepen-

dent group of mining tasks. Such partitioning elimi-

nates computational dependencies a nd communications

among machines.

The P FP algorithm consists of ﬁve steps, three of

which set up a MapReduce job separately. The details

of the ﬁve steps are shown as follows. We can see that

PFP is able to reach excellent parallelism due to inde-

pendent threads in each job.

Step 1. Sharding: dividing the input database D

into successive parts and storing the parts on P diﬀe-

rent co mputers. Such division and distribution of data

is called sharding, and each par t is called a shard. This

step is usually ﬁnished by the partition process of the

MapReduce framework.

Step 2. Parallel counting: doing a MapReduce pass

to count the suppo rt values of all items that appear in

D. Each mapper inputs one shard of D, and for every

emergence of an item i in the shard, the mapper out-

puts the key/value pa ir (i, 1). The reducer collects the

key/va lue pairs with the same key k, and simply adds

the values of those pairs to gether to get a sum s. Then

it outputs a key/value pair (k, s). This step implic-

itly discovers the items’ vocabulary V , which is usually

unknown for a huge D. The result is stored in F-list.

Step 3. Grouping items: dividing all the |I| items

on F-list into Q groups. The list of groups is called

group list (G-list), where each group is given a unique

group-ID (gid). As F-list and G-list are both relatively

small and the time complexity is O(| I|), this step can

complete on a single computer in few seconds.

Step 4. Parallel FP-Growth: this is the key step

of PFP. This step takes a MapReduce pass, where the

mappers and the reducers p erform diﬀerent functions

as follows. The details of each function ca n be seen in

Fig.3:

mapper: generating group-dependent transactions;

reducer: FP-Growth on group-dependent shards.

Step 5. Ag gregating: aggregating the results gener-

ated in step 4 as our ﬁnal result. Algorithms of the

mappers and the r e ducers are desc rib e d in detail in

Fig.4.

3.3 CanTree: Canonical-Order Tree

The CanTree (canonical-order tree) designed by Le-

ung et al.

[14,16]

is a novel tree structure that captures

the content of the transaction database and orders tree

nodes according to some canonical order. It does not

require any adjustment, merging , or splitting of tree

nodes during its maintenance. When incremental up-

dating happens, neither the rescanning of the entire

updated data base nor the reconstruction of a new tree

is needed.

Procedure: Mapper(key, value =T

)

Load G-list ;

Generate hash table H from G-list ;

a[] ß Split (T

);

for j = |T

| ֓1 to 0 do

HashNum ß getHashNum(H, a[j]);

if HashNum ≠ null then

Delete all pairs whose hash value is HashNum

in H;

Call Output(<HashNum, a[0]+a[1]+

+a[j]>);

end

Procedure: Reducer(key = gid, value =D

gid

)

Load G-list ;

nowGroup ß G-list

gid

;

LocalFPTree ß clear;

foreach T

in D(gid) do

Call insert-build-fp-tree(LocalFPTree, T

);

end

foreach a

in nowGroup do

Define an empty max heap HP with size K;

Call TopKFPGrowth(LocalFPtree, a

, HP);

foreach S

in HP do

Call Output(<null, S

+ supp(S

)>);

end

Fig.3. Functions performed by mappers and reducers in parallel

FP-Growth

[7]

Procedure: Mapper(key, value = S + supp(S))

foreach item a

in S do

Call Output(<a

, S + supp(S)>);

end

Procedure: Reducer(key = a

, value = set(S + supp(S)))

Define an empty max heap HP with size K;

foreach itemset S in S + supp(S) do

if |HP| < K then

Insert S + supp(S) into HP;

else

if supp(HP

[0].S) < supp(S) then

Delete top element in HP;

Insert S + supp(S) into HP;

end

Call Output(<null, a

i + HP>);

Fig.4. Functions performed by mappers and reducers in aggre-

gating

[7]

In CanTree, items are arra nged according to some

canonical order, which can be determined by the user

prior to the mining process or at running time during

剩余17页未读，继续阅读

weixin_38518638

粉丝: 3
资源: 932

大数据并行增量频繁项集挖掘：IncMiningPFP与IncBuildingPFP算法优化

工业大数据背景下频繁项集挖掘算法对比分析及研究展望.zip

基于信息熵与遗传算法的并行关联规则增量挖掘算法.docx

Web数据挖掘在信息管理的运用.doc

GPU异构计算下的关联规则挖掘：增量式算法与性能分析

关联规则与动态关联规则挖掘算法详解

关联规则挖掘技术研究综述与发展趋势

动态关联规则：挖掘与预测时间变化趋势

使用FP-Growth算法进行频繁项集挖掘

实时关联规则挖掘：大数据环境下的必备策略

无监督学习案例研究：大数据集中的异常检测技巧

最新资源