分布式环境下Spark实现的FP-Growth算法

需积分: 50 118 浏览量更新于2024-09-10 收藏 740KB PDF 举报

"FP-Growth的spark实现算法，用于大规模数据集的频繁项集挖掘，通过并行化提升挖掘效率。" FP-Growth算法是一种在数据挖掘领域广泛使用的发现频繁项集的方法。它通过构建FP树（Frequent Pattern Tree）来高效地找出数据库中频繁出现的项集，尤其适用于处理大数据集。然而，随着数据集规模的增加，传统FP-Growth算法的内存消耗和计算成本会变得非常高。 Spark是一个分布式计算框架，能够有效地处理大规模数据。将FP-Growth与Spark结合，可以将挖掘任务分布到多个节点上，从而实现并行计算，降低单个节点的压力，提高整体性能。文章中提到的PFP（Parallel FP-Growth）是针对这一问题提出的一种解决方案。PFP算法将挖掘任务划分为多个独立的子任务，每个节点执行一部分任务，避免了节点间的计算依赖，减少了通信开销。在PFP算法中，首先会对数据进行预处理，生成FP树的各个部分，然后在分布式环境中并行地进行模式增长。每个节点独立地找到其负责部分的频繁项集，最后再通过聚合操作合并所有节点的结果，形成完整的频繁项集集合。这种方法显著提高了在大规模数据集上的查询推荐效率。文章通过实证研究验证了PFP算法的效果，使用了一个包含802,939个网页和1,021,107个事务的大数据集。实验结果表明，PFP算法在保持准确性的同时，显著减少了计算时间和内存需求，尤其对于那些不适合使用传统方法处理的大型数据集，PFP提供了更优的解决方案。总结来说，FP-Growth的Spark实现，如PFP算法，是应对大数据集频繁项集挖掘挑战的有效工具。它利用Spark的并行计算能力，优化了FP-Growth的内存使用和计算效率，为大规模数据集的分析提供了强大的支持。这种技术在推荐系统、市场篮子分析、用户行为分析等领域有广泛的应用价值。

PFP: Parallel FP-Growth for Query Recommendation

Haoyuan Li

Google Beijing Research,

Beijing, 100084, China

Yi Wang

Google Beijing Research,

Beijing, 100084, China

Dong Zhang

Google Beijing Research,

Beijing, 100084, China

Ming Zhang

Dept. Computer Science,

Peking University, Beijing,

100071, China

Edward Chang

Google Research, Mountain

View, CA 94043, USA

ABSTRACT

Frequent itemset mining (FIM) is a useful tool for discov-

ering frequently co-occurrent items. Since its inception, a

number of signiﬁcant FIM algorithms have been developed

to speed up mining performance. Unfortunately, when the

dataset size is huge, both the memory use and computa-

tional cost can still be prohibitively expensive. In this work,

we propose to parallelize the FP-Growth algorithm (we call

our parallel algorithm PFP) on distributed machines. PFP

partitions computation in such a way that each machine

executes an independent group of mining tasks. Such parti-

tioning eliminates computational dependencies between ma-

chines, and thereby communication between them. Through

empirical study on a large dataset of 802, 939 Web pages and

1, 021, 107 tags, we demonstrate that PFP can achieve virtu-

ally linear speedup. Besides scalability, the empirical study

demonstrates that PFP to be promising for supporting query

recommendation for search engines.

Categories and Subject Descriptors

H.3 [Information Storage and Retrieval]; H.4 [Information

Systems Applications]

General Terms

Algorithms, Experimentation, Human Factors, Performance

Keywords

Parallel FP-Growth, Data Mining, Frequent Itemset Mining

1. INTRODUCTION

In this paper, we attack two problems. First, we par-

allelize frequent itemset mining (FIM) so as to deal with

large-scale data-mining problems. Second, we apply our de-

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

ACM RS

veloped parallel algorithm on Web data to support query

recommendation (or related search).

FIM is a useful tool for discovering frequently co-occurrent

items. Existing FIM algorithms such as Apriori [9] and FP-

Growth [6] can be resource intensive when a mined dataset is

huge. Parallel algorithms were developed for reducing mem-

ory use and computational cost on each machine. Early

eﬀorts (related work is presented in greater detail in Sec-

tion 1.1) focused on speeding up the Apriori algorithm. Since

the FP-Growth algorithm has been shown to run much faster

than the Apriori, it is logical to parallelize the FP-Growth

algorithm to enjoy even faster speedup. Recent work in

parallelizing FP-Growth [10, 8] suﬀers from high communi-

cation cost, and hence constrains the percentage of compu-

tation that can be parallelized. In this pap er, we propose

a MapReduce approach [4] of parallel FP-Growth algorithm

(we call our proposed algorithm PFP), which intelligently

shards a large-scale mining task into independent compu-

tational tasks and maps them onto MapReduce jobs. PFP

can achieve near-linear speedup with capability of restarting

from computer failures.

The resource problem of large-scale FIM could be worked

around in a classic market-basket setting by pruning out

items of low support. This is because low-support itemsets

are usually of little practical value, e.g., a merchandise with

low support (of low consumer interest) cannot help drive

up revenue. However, in the Web search setting, the huge

number of low-support queries, or long-tail queries [2], each

must be maintained with high search quality. The impor-

tance of low-support frequent itemsets in search applications

requires FIM to confront its resource bottlenecks head-on.

In particular, this paper shows that a post-search recom-

mendation tool called related search can beneﬁt a great deal

from our scalable FIM solution. Related search provides

related queries to the user after an initial search has been

completed. For instance, a query of ’apple’ may suggest

’orange’, ’iPod’ and ’iPhone’ as alternate queries. Related

search can also suggest related sites of a given site (see ex-

ample in Section 3.2).

1.1 Related Work

Some previous eﬀorts [10] [7] parallelized the FP-Growth

algorithm across multiple threads but with shared memory.

However, to our problem of processing huge databases, these

approaches do not address the bottleneck of huge memory

requirement.

下载后可阅读完整内容，剩余7页未读，立即下载

qq_27148949

粉丝: 0
资源: 1

分布式环境下Spark实现的FP-Growth算法

关联分析：FP-Growth算法.pdf

FP-Growth算法的改进

FP-growth python 实现

Scala实现Spark的FP-Growth算法详解

如何基于Spark ML实现FP-growth算法

FP-growth算法改进与分布式Spark研究.pdf

spark-fp-growth

使用Spring Boot开发框架和Spark MLlib机器学习框架，通过FP-Growth算法，分析用户的购物车商品数据

Python实现FP-growth算法的关联规则挖掘

Spark框架下FP-Growth算法在大数据频繁项集挖掘中的应用

最新资源