Parallel Implementation of Apriori Algorithm Based on MapReduce
Ning Li
1,2,3
, Li Zeng
1
, Qing He
1
and Zhongzhi Shi
1
1
The Key Laboratory of Intelligent Information Processing,
Institute of Computing Technology, Chinese Academy of Sciences,Beijing,100190,China
2
Graduate University of Chinese Academy of Sciences,Beijing,100139,China
3
Key Lab. of Machine Learning and Computational Intelligence, College of Mathematics and Computer Science, Hebei University,
Baoding, 071002, Hebei, China
lin@ics.ict.ac.cn, heq@ics.ict.ac.cn
Abstract—Searching frequent patterns in transactional
databases is considered as one of the most important data
mining problems and Apriori is one of the typical algorithms
for this task. Developing fast and efficient algorithms that can
handle large volumes of data becomes a challenging task due
to the large databases. In this paper, we implement a parallel
Apriori algorithm based on MapReduce, which is a
framework for processing huge datasets on certain kinds of
distributable problems using a large number of computers
(nodes). The experimental results demonstrate that the
proposed algorithm can scale well and efficiently process large
datasets on commodity hardware.
Keywords-Apriori algorithm; Frequent itemsets;
MapReduce; Parallel implementation; Large database
I. INTRODUCTION
Data Mining has attracted a great deal of attention in the
information industry and in society as a whole in recent
years. One of the important problems in data mining is
discovering association rules from databases of transactions
where each transaction consists of a set of items. Many
algorithms have been proposed to find frequent item sets
from a large database. However, there has not yet been
published implementation performing the best under
whatever conditions [14]. Apriori is one of the typical
algorithms, which is a seminal algorithm proposed by
R.Agrawal and R.Srikant in 1994 for mining frequent
itemsets for Boolean association rules [5]. It aggressively
prunes the set of potential candidates of size k by using the
following observation: a candidate of size k can be frequent
only if all of its subsets also meet the minimum threshold of
support. Even with the pruning, the task of finding all
association rules requires a lot of computation power and
memory. Parallel computers offer a potential solution to the
computation requirement of this task, if the efficient and
scalable parallel algorithms can be designed.
MapReduce is a
patented software framework introduced
by Google in 2004. It is a programming model and an
associated implementation for processing and generating
large data sets in a massively parallel manner [2,6]. Some
data preprocessing, clustering and classification algorithms
have been implemented based on MapReduce [1,8,13].
In this paper, we implemented the parallel Apriori
algorithm based on MapReduce, which makes it applicable
to mine association rules from large databases of
transactions.
The rest of the paper is organized as follows. In
Section 2, we introduce the basic Apriori algorithm. Section
3 gives an overview of MapReduce. In Section 4, we
present the details of the parallel implementation of Apriori
algorithms based on MapReduce. Experimental results and
evaluations are showed in Section 5 with respect to speedup,
scaleup, and sizeup. Finally, Section 6 concludes the paper.
II. A
PRIORI ALGORITHM
A. Problem Statement
The problem of mining association rules over market
basket analysis was introduced in [10]. It consists of finding
associations between items or itemsets in transactional data
[7].
As defined in [11], the problem can be formally stated
as follows. Let
12
{, , , }
m
ii i= ! be a set of literals,
called items. Let
be a set of transactions, where each
transaction
T
is a set of items such that TI⊆ . Each
transaction has a unique identifier TID. A transaction
T
is
said to contain
X
, a set of items in
, if XT⊆ . An
association rule is an implication of the form “
XY ”,
where
XI⊆ , YI⊆ and
XY=∅
.
Each itemset has an associated measure of statistical
significance called support. For An itemset
, we say its
support is
if the fraction of transactions in
containing
equals
.
The rule
XY has a support
in the transaction set
if s of the transactions in
contain
Y*
. The problem of discovering all
association rules from a set of transactions
consists of
generating the rules that have a support and confidence
greater than given thresholds. These rules are called strong
rules.
This association-mining task can be broken into two
steps:
Step1. The large or frequent itemsets which have support
above the user specified minimum support are generated.
Step2. Generate confident rules from the frequent itemsets.
2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed
Computing
978-0-7695-4761-9/12 $26.00 © 2012 IEEE
DOI 10.1109/SNPD.2012.31
236