PFP: Parallel FP-Growth for Query Recommendation
Haoyuan Li
Google Beijing Research,
Beijing, 100084, China
Yi Wang
Google Beijing Research,
Beijing, 100084, China
Dong Zhang
Google Beijing Research,
Beijing, 100084, China
Ming Zhang
Dept. Computer Science,
Peking University, Beijing,
100071, China
Edward Chang
Google Research, Mountain
View, CA 94043, USA
ABSTRACT
Frequent itemset mining (FIM) is a useful tool for discov-
ering frequently co-occurrent items. Since its inception, a
number of significant FIM algorithms have been developed
to speed up mining performance. Unfortunately, when the
dataset size is huge, both the memory use and computa-
tional cost can still be prohibitively expensive. In this work,
we propose to parallelize the FP-Growth algorithm (we call
our parallel algorithm PFP) on distributed machines. PFP
partitions computation in such a way that each machine
executes an independent group of mining tasks. Such parti-
tioning eliminates computational dependencies between ma-
chines, and thereby communication between them. Through
empirical study on a large dataset of 802, 939 Web pages and
1, 021, 107 tags, we demonstrate that PFP can achieve virtu-
ally linear speedup. Besides scalability, the empirical study
demonstrates that PFP to be promising for supporting query
recommendation for search engines.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]; H.4 [Information
Systems Applications]
General Terms
Algorithms, Experimentation, Human Factors, Performance
Keywords
Parallel FP-Growth, Data Mining, Frequent Itemset Mining
1. INTRODUCTION
In this paper, we attack two problems. First, we par-
allelize frequent itemset mining (FIM) so as to deal with
large-scale data-mining problems. Second, we apply our de-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ACM RS
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.
veloped parallel algorithm on Web data to support query
recommendation (or related search).
FIM is a useful tool for discovering frequently co-occurrent
items. Existing FIM algorithms such as Apriori [9] and FP-
Growth [6] can be resource intensive when a mined dataset is
huge. Parallel algorithms were developed for reducing mem-
ory use and computational cost on each machine. Early
efforts (related work is presented in greater detail in Sec-
tion 1.1) focused on speeding up the Apriori algorithm. Since
the FP-Growth algorithm has been shown to run much faster
than the Apriori, it is logical to parallelize the FP-Growth
algorithm to enjoy even faster speedup. Recent work in
parallelizing FP-Growth [10, 8] suffers from high communi-
cation cost, and hence constrains the percentage of compu-
tation that can be parallelized. In this pap er, we propose
a MapReduce approach [4] of parallel FP-Growth algorithm
(we call our proposed algorithm PFP), which intelligently
shards a large-scale mining task into independent compu-
tational tasks and maps them onto MapReduce jobs. PFP
can achieve near-linear speedup with capability of restarting
from computer failures.
The resource problem of large-scale FIM could be worked
around in a classic market-basket setting by pruning out
items of low support. This is because low-support itemsets
are usually of little practical value, e.g., a merchandise with
low support (of low consumer interest) cannot help drive
up revenue. However, in the Web search setting, the huge
number of low-support queries, or long-tail queries [2], each
must be maintained with high search quality. The impor-
tance of low-support frequent itemsets in search applications
requires FIM to confront its resource bottlenecks head-on.
In particular, this paper shows that a post-search recom-
mendation tool called related search can benefit a great deal
from our scalable FIM solution. Related search provides
related queries to the user after an initial search has been
completed. For instance, a query of ’apple’ may suggest
’orange’, ’iPod’ and ’iPhone’ as alternate queries. Related
search can also suggest related sites of a given site (see ex-
ample in Section 3.2).
1.1 Related Work
Some previous efforts [10] [7] parallelized the FP-Growth
algorithm across multiple threads but with shared memory.
However, to our problem of processing huge databases, these
approaches do not address the bottleneck of huge memory
requirement.