Yu-Geng Song et al.: Parallel Incremental Frequent Itemset Mining for Large Data 371
in such a way that each machine executes an indepen-
dent group of mining tasks. Such partitioning elimi-
nates computational dependencies a nd communications
among machines.
The P FP algorithm consists of five steps, three of
which set up a MapReduce job separately. The details
of the five steps are shown as follows. We can see that
PFP is able to reach excellent parallelism due to inde-
pendent threads in each job.
Step 1. Sharding: dividing the input database D
into successive parts and storing the parts on P diffe-
rent co mputers. Such division and distribution of data
is called sharding, and each par t is called a shard. This
step is usually finished by the partition process of the
MapReduce framework.
Step 2. Parallel counting: doing a MapReduce pass
to count the suppo rt values of all items that appear in
D. Each mapper inputs one shard of D, and for every
emergence of an item i in the shard, the mapper out-
puts the key/value pa ir (i, 1). The reducer collects the
key/va lue pairs with the same key k, and simply adds
the values of those pairs to gether to get a sum s. Then
it outputs a key/value pair (k, s). This step implic-
itly discovers the items’ vocabulary V , which is usually
unknown for a huge D. The result is stored in F-list.
Step 3. Grouping items: dividing all the |I| items
on F-list into Q groups. The list of groups is called
group list (G-list), where each group is given a unique
group-ID (gid). As F-list and G-list are both relatively
small and the time complexity is O(| I|), this step can
complete on a single computer in few seconds.
Step 4. Parallel FP-Growth: this is the key step
of PFP. This step takes a MapReduce pass, where the
mappers and the reducers p erform different functions
as follows. The details of each function ca n be seen in
Fig.3:
mapper: generating group-dependent transactions;
reducer: FP-Growth on group-dependent shards.
Step 5. Ag gregating: aggregating the results gener-
ated in step 4 as our final result. Algorithms of the
mappers and the r e ducers are desc rib e d in detail in
Fig.4.
3.3 CanTree: Canonical-Order Tree
The CanTree (canonical-order tree) designed by Le-
ung et al.
[14,16]
is a novel tree structure that captures
the content of the transaction database and orders tree
nodes according to some canonical order. It does not
require any adjustment, merging , or splitting of tree
nodes during its maintenance. When incremental up-
dating happens, neither the rescanning of the entire
updated data base nor the reconstruction of a new tree
is needed.
Procedure: Mapper(key, value =T
i
)
Load G-list ;
Generate hash table H from G-list ;
a[] ß Split (T
i
);
for j = |T
i
| ֓1 to 0 do
HashNum ß getHashNum(H, a[j]);
if HashNum ≠ null then
Delete all pairs whose hash value is HashNum
in H;
Call Output(<HashNum, a[0]+a[1]+
Ă
+a[j]>);
end
end
Procedure: Reducer(key = gid, value =D
gid
)
Load G-list ;
nowGroup ß G-list
gid
;
LocalFPTree ß clear;
foreach T
i
in D(gid) do
Call insert-build-fp-tree(LocalFPTree, T
i
);
end
foreach a
i
in nowGroup do
Define an empty max heap HP with size K;
Call TopKFPGrowth(LocalFPtree, a
i
, HP);
foreach S
i
in HP do
Call Output(<null, S
i
+ supp(S
i
)>);
end
end
Fig.3. Functions performed by mappers and reducers in parallel
FP-Growth
[7]
.
Procedure: Mapper(key, value = S + supp(S))
foreach item a
i
in S do
Call Output(<a
i
, S + supp(S)>);
end
Procedure: Reducer(key = a
i
, value = set(S + supp(S)))
Define an empty max heap HP with size K;
foreach itemset S in S + supp(S) do
if |HP| < K then
Insert S + supp(S) into HP;
else
if supp(HP
[0].S) < supp(S) then
Delete top element in HP;
Insert S + supp(S) into HP;
end
end
end
Call Output(<null, a
i + HP>);
Fig.4. Functions performed by mappers and reducers in aggre-
gating
[7]
.
In CanTree, items are arra nged according to some
canonical order, which can be determined by the user
prior to the mining process or at running time during