大规模并行FP-Growth算法优化查询推荐

需积分: 9 16 浏览量更新于2024-09-11 收藏 643KB PDF 举报

"平行FP-Growth算法在查询推荐中的应用" 本文讨论了一种针对大规模数据集的改进方法，即并行FP-Growth（PFP）算法，该算法旨在解决频繁项集挖掘（Frequent Itemset Mining，FIM）中的性能瓶颈问题。自FP-Growth算法提出以来，尽管已经发展出许多优化版本来提高效率，但在处理海量数据时，内存使用和计算成本仍然构成挑战。PFP算法的目标是将经典的FP-Growth算法扩展到分布式系统，通过在多台机器上并行执行独立的挖掘任务来实现性能提升。 PFP的核心思想是将数据集和计算任务进行分割，每个机器负责处理一部分，这样可以消除不同机器之间的计算依赖性，进而减少通信开销。这种设计使得算法能够在保持挖掘结果准确性的同时，显著降低对单机资源的需求。作者们在一项针对包含802,939个网页的大规模实验中验证了PFP的性能，研究涵盖了1,021,100个查询，展示了在分布式环境下如何有效利用并行计算来加速查询推荐过程。通过实证分析，PFP不仅提高了挖掘速度，还可能减少存储需求，这对于处理大规模数据的在线应用，如电子商务、搜索引擎和社交网络等领域来说，具有重要的实际价值。此外，PFP也展示了在现代云计算环境中的可扩展性和适应性，为大规模数据的实时分析和个性化推荐提供了新的解决方案。总结来说，本文的主要贡献在于提出了一个适用于分布式系统的并行FP-Growth算法（PFP），它通过负载均衡和减少通信开销来改善查询推荐的性能，适用于处理大数据集时的高效挖掘任务。这为数据密集型应用提供了一个重要的技术突破，有望推动IT行业的进一步发展。

PFP: Parallel FP-Growth for Query Recommendation

Haoyuan Li

Google Beijing Research,

Beijing, 100084, China

Yi Wang

Google Beijing Research,

Beijing, 100084, China

Dong Zhang

Google Beijing Research,

Beijing, 100084, China

Ming Zhang

Dept. Computer Science,

Peking University, Beijing,

100071, China

Edward Y. Chang

Google Research, Mountain

View, CA 94043, USA

ABSTRACT

Frequent itemset mining (FIM) is a useful tool for discov-

ering frequently co-occurrent items. Since its inception, a

number of signiﬁcant FIM algorithms have been developed

to speed up mining performance. Unfortunately, when the

dataset size is huge, both the memory use and comput a-

tional cost can still be prohibitively expensive. In this work,

we propose to parallelize the FP-Growth algorithm (we call

our parallel algorithm PFP) on distributed machines. PFP

partitions computation in such a way that each machine

executes an independent group of mining tasks. Such parti-

tioning eliminates computational dependen cies between ma-

chines, and thereby communication between them. Through

empirical study on a large dataset of 802, 939 Web pages and

1, 021, 107 tags, we demonstrate that PFP can achieve virtu-

ally linear speedup. Besides scalability, the empirical study

demonstrates that PFP to be promising for supporting query

recommendation for search engines.

Categories and Subject Descriptors

H.3 [Information Storage and Retrieval]; H.4 [Information

Systems Applications]

General Terms

Algorithms, Experimentation, Hu man Factors, Performance

Keywords

Parallel FP-Growth, Data Mining, Frequent Itemset Mining

1. INTRODUCTION

In this paper, we attack two problems. First, we par-

allelize frequent itemset mining (FIM) so as to deal with

large-scale data-mining problems. Second, we apply our de-

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

RecSys’08, October 23–25, 2008, Lausanne, Switzerland.

veloped parallel algorithm on Web data to support query

recommendation (or related search).

FIM is a useful tool for discovering frequently co-occurrent

items. Existing FIM algorithms such as Apriori [9] and FP-

Growth [6] can be resource intensive when a mined dataset is

huge. Parallel algorithms were developed for reducing mem-

ory use and computational cost on each machine. Early

eﬀorts (related work is presented in greater detail in Sec-

tion 1.1) focused on speeding up the Apriori algorithm. Since

the FP-Growth algorithm has been shown to run much faster

than t he Apriori, it is logical t o parallelize the FP-Growth

algorithm to enjoy even faster speedup. Recent work in

parallelizing FP-Growth [10, 8] suﬀers from high communi-

cation cost, and hence constrains t he percentage of compu-

tation th at can be parallelized. In this paper, we propose

a MapReduce approach [4] of parallelizing FP-Growth al-

gorithm (we call our proposed algorithm PFP), which in-

telligently shards a large-scale mining task into indepen-

dent computational tasks and maps them onto MapReduce

jobs. PFP can achieve near-linear speedup with capability

of restarting from computer failures.

The resource problem of large-scale FIM could be worked

around in a classic market-basket setting by pruning out

items of low support. This is because low-support itemsets

are usually of little practical value, e.g., a merchandise with

low support (of low consumer interest) cannot help drive

up revenue. However, in the Web search setting, the huge

number of low-support queries, or long-tail queries [2], each

must be maintained with high search quality. The impor-

tance of low-support frequent itemsets in search applications

requires FIM to confront its resource bottlenecks head-on.

In particular, this paper shows that a post-search recom-

mendation tool called related search can beneﬁt a great deal

from our scalable FIM solution. Related search provides

related queries to the user after an initial search has been

completed. For instance, a query of ’apple’ may su ggest

’orange’, ’iPod’ and ’iPhone’ as alternate queries. Related

search can also suggest related sites of a given site (see ex-

ample in Section 3.2).

1.1 Related Work

Some previous eﬀorts [10] [7] parallelized the FP-Growth

algorithm across multiple threads but with shared memory.

However, to our problem of processing huge databases, these

approaches do not address t he bottleneck of huge memory

requirement.

107

下载后可阅读完整内容，剩余7页未读，立即下载

kinglylq

粉丝: 0
资源: 4

大规模并行FP-Growth算法优化查询推荐

FP-Growth的spark实现算法

Parallel-FP-Growth-on-hadoop

defunct-template-scala-parallel-universal-recommendation

template-scala-parallel-universal-recommendation, 通用推荐PredictiionIO模板.zip

Structured Parallel Programming - Patterns for Efficient

Structured Parallel Programming - Patterns for Efficient Computation

parallel for-开源

Parallel Depth-First Search for Directed Acyclic Graphs - 2018-计算机科学

A High-Speed Column-Parallel Time-Digital Single-Slope ADC for CMOS Image Sensors

最新资源