HPFP-Miner: A Novel Parallel Frequent Itemset Mining Algorithm
CHEN Xiaoyun
1
, HE Yanshan
1
, CHEN Pengfei
1
, MIAO Shengfa
1
, SONG Weiguo
1
, YUE Min
1
1.School of Information Science and Engineering
Lanzhou University
Lanzhou Gansu, PRC
e-mail: heysh07@lzu.cn
Abstract—Frequent itemset mining is a fundamental and
essential issue in data mining field and can be used in many
data mining tasks. Most of these mining tasks require multiple
passes over the database and if the database size is large,
which is usually the case, scalable high performance solutions
involving multiple processors are required. In this paper, we
present a novel parallel frequent itemset mining algorithm
which is called HPFP-Miner. The proposed algorithm is based
on FP-Growth and introduces little communication overheads
by efficiently partitioning the list of frequent elements list over
processors. The results of experiment show that HPFP-Miner
has good scalability and performance.
Keywords-frequent itemset; data mining; parallel; HPFP-
Miner; FP-Growth
I. INTRODUCTION
Frequent itemset mining is a fundamental and essential
issue in data mining field and can be used in many data
mining tasks which include association rules analysis,
correlations analysis, classification, clustering, outlier
analysis, etc. Since 1993 Agrawal et al.[1] proposed
association rules analysis problem through research of the
relationship among itemsets in custom transaction database
and proved the core of association rules analysis problem
was mining the frequent itemsets which could generate
strong association rules, the issue of mining frequent
itemsets has been an active research topic.
However, the huge search space makes frequent pattern
mining on large databases a very computationally intensive
problem. The performance of a serial frequent pattern
mining algorithm is limited by its single processor
computing capability and finite memory space; thus, they do
not scale to large datasets. Therefore, it is very necessary to
exploit high performance parallel/distributed computing to
relieve the current frequent pattern mining algorithms from
the sequential bottleneck, and helps the FPM algorithms
scale to massive datasets[2].
FP-Growth[3] algorithm proposed by J.Han in 2000 is
one of the most popular frequent itemset mining algorithm.
It adopts divide-and-conquer method which compresses
transaction database to a frequent pattern tree—FP-tree first,
then mines complete frequent itemsets from FP-tree. The
intrinsic divide-and-conquer nature of FP-growth can easily
be parallelized. In 2001, Osmar R. Zaïane et al. proposed
MLFPT[4] which is a simple parallel algorithm based on
FP-growth for mining frequent itemsets on shared memory
machines. The key idea behind this task parallel paradigm is
the divide-and-conquer nature of the FP-growth algorithm.
Later, Javed and Khokhar et al. proposed another FP-
growth-based parallel algorithm called PFP-tree[5] but
running on a distributed memory architecture. The FP-tree
building process is similar to that in MLFPT except that
multiple FP-trees reside in different physical memory spaces.
The mining task is also divided in the same way as in
MLFPT. Another difference is that the conditional pattern
base of each 1-itemset must be exchanged among processors
so that each can hold the needed portion of the database
locally.
In this paper, we present a novel parallel frequent itemset
mining algorithm, called HPFP-Miner, to mine the complete
set of frequent itemsets on parallel distributed memory
architecture. We implement HPFP-Miner on four real world
datasets, and compare the results with two famous FP-
Growth-like algorithms. The experimental results show that
our algorithm has a better scalability and performance.
II.
R
ELATED WORK
For the sake of completeness, we give the relative
problem description of frequent itemset mining and FP-
Growth in this section.
Definition 1: Let I={i
1
,i
2
,ĂĂ,i
n
} be a set of items,
where an itemset X is a subset of I such that
I
. An
itemset of length k is called k-itemset. Let D be a set of
transactions, where each transaction T is a set of items such
that . A unique identified TID is given to each
transaction. The support of an itemset X is the number of
transactions in D that contain X. If the support of X is
greater than or equal to a predefined support threshold
TD
, it
is called a frequent itemset or frequent pattern, otherwise, it
is infrequent.
The frequent pattern mining problem is to discover the
complete set of all patterns contained in at least a specified
support threshold
, of transactions in the transaction
database.
FP-Growth-like algorithms adopt divide-and-conquer
method which can be stated as follow: First, it compresses
the database representing frequent items into a frequent-
pattern tree, or FP-tree, which retains the itemset association
information. It then divides the compressed database into a
set of conditional databases (a special kind of projected
database), each associated with one frequent item or “pattern
fragment,” and mines each such database separately[6].
FP-Growth opened up a new way to efficiently mine
frequent pattern. However, its low time and space utilization
2009 Fifth International Conference on Natural Computation
978-0-7695-3736-8/09 $25.00 © 2009 IEEE
DOI 10.1109/ICNC.2009.263
139
2009 Fifth International Conference on Natural Computation
978-0-7695-3736-8/09 $25.00 © 2009 IEEE
DOI 10.1109/ICNC.2009.263
139
2009 Fifth International Conference on Natural Computation
978-0-7695-3736-8/09 $25.00 © 2009 IEEE
DOI 10.1109/ICNC.2009.263
139
2009 Fifth International Conference on Natural Computation
978-0-7695-3736-8/09 $25.00 © 2009 IEEE
DOI 10.1109/ICNC.2009.263
139
2009 Fifth International Conference on Natural Computation
978-0-7695-3736-8/09 $25.00 © 2009 IEEE
DOI 10.1109/ICNC.2009.263
139
2009 Fifth International Conference on Natural Computation
978-0-7695-3736-8/09 $25.00 © 2009 IEEE
DOI 10.1109/ICNC.2009.263
139
2009 Fifth International Conference on Natural Computation
978-0-7695-3736-8/09 $25.00 © 2009 IEEE
DOI 10.1109/ICNC.2009.263
139
2009 Fifth International Conference on Natural Computation
978-0-7695-3736-8/09 $25.00 © 2009 IEEE
DOI 10.1109/ICNC.2009.263
139
2009 Fifth International Conference on Natural Computation
978-0-7695-3736-8/09 $25.00 © 2009 IEEE
DOI 10.1109/ICNC.2009.263
139
2009 Fifth International Conference on Natural Computation
978-0-7695-3736-8/09 $25.00 © 2009 IEEE
DOI 10.1109/ICNC.2009.263
139
2009 Fifth International Conference on Natural Computation
978-0-7695-3736-8/09 $25.00 © 2009 IEEE
DOI 10.1109/ICNC.2009.263
139