并行MapReduce实现Apriori算法：大数据集高效挖掘

166 浏览量更新于2024-08-26 1 收藏 312KB PDF 举报

"这篇论文主要探讨了如何基于MapReduce框架实现Apriori算法的并行化，以处理大规模事务数据库中的频繁模式挖掘问题。通过利用分布式计算的优势，提高数据处理效率，适应大数据环境的需求。实验结果显示，这种方法在处理大型数据集时表现出良好的可扩展性和高效性。" Apriori算法是一种经典的关联规则学习算法，它主要用于在事务数据库中发现频繁项集和强关联规则。该算法的基本思想是：如果一个项集是频繁的，那么它的所有子集也必须是频繁的。然而，随着数据库规模的增长，传统的Apriori算法在时间和空间效率上面临挑战。 MapReduce是一种由Google提出的编程模型，适用于处理和生成大型数据集。它将复杂的计算任务分解为两个主要阶段：Map阶段和Reduce阶段。Map阶段将输入数据分成多个键值对，然后并行处理；Reduce阶段则对Map阶段的结果进行聚合和总结，生成最终输出。在本文中，作者实现的并行Apriori算法充分利用了MapReduce的并行计算能力。在Map阶段，每个节点负责处理一部分事务数据，生成局部频繁项集；在Reduce阶段，这些局部频繁项集被合并，以识别全局频繁项集。这种并行化方法显著减少了数据扫描和候选人生成的次数，从而提高了整体性能。此外，文章还可能讨论了算法优化策略，如减少中间结果的通信开销、内存管理以及如何在分布式环境中有效地存储和传输数据。通过这些优化，算法能够在商品硬件上有效地处理大型数据集，而不需昂贵的高性能计算资源。实验部分可能对比了并行Apriori算法与单机版本或其他并行算法的性能，展示了在处理大规模数据时的优越性。这可能包括运行时间、内存占用和并行度对性能的影响等方面的数据。最后，论文可能会讨论算法的局限性以及未来可能的研究方向，比如如何进一步优化并行算法以适应更复杂的数据分布和计算需求。这篇论文为在大数据环境下提高Apriori算法的执行效率提供了一种有效的解决方案，对于理解和应用并行数据挖掘技术具有重要的理论和实践价值。

Parallel Implementation of Apriori Algorithm Based on MapReduce

Ning Li

1,2,3

, Li Zeng

, Qing He

and Zhongzhi Shi

The Key Laboratory of Intelligent Information Processing,

Institute of Computing Technology, Chinese Academy of Sciences,Beijing,100190,China

Graduate University of Chinese Academy of Sciences,Beijing,100139,China

Key Lab. of Machine Learning and Computational Intelligence, College of Mathematics and Computer Science, Hebei University,

Baoding, 071002, Hebei, China

lin@ics.ict.ac.cn, heq@ics.ict.ac.cn

Abstract—Searching frequent patterns in transactional

databases is considered as one of the most important data

mining problems and Apriori is one of the typical algorithms

for this task. Developing fast and efficient algorithms that can

handle large volumes of data becomes a challenging task due

to the large databases. In this paper, we implement a parallel

Apriori algorithm based on MapReduce, which is a

framework for processing huge datasets on certain kinds of

distributable problems using a large number of computers

(nodes). The experimental results demonstrate that the

proposed algorithm can scale well and efficiently process large

datasets on commodity hardware.

Keywords-Apriori algorithm; Frequent itemsets;

MapReduce; Parallel implementation; Large database

I. INTRODUCTION

Data Mining has attracted a great deal of attention in the

information industry and in society as a whole in recent

years. One of the important problems in data mining is

discovering association rules from databases of transactions

where each transaction consists of a set of items. Many

algorithms have been proposed to find frequent item sets

from a large database. However, there has not yet been

published implementation performing the best under

whatever conditions [14]. Apriori is one of the typical

algorithms, which is a seminal algorithm proposed by

R.Agrawal and R.Srikant in 1994 for mining frequent

itemsets for Boolean association rules [5]. It aggressively

prunes the set of potential candidates of size k by using the

following observation: a candidate of size k can be frequent

only if all of its subsets also meet the minimum threshold of

support. Even with the pruning, the task of finding all

association rules requires a lot of computation power and

memory. Parallel computers offer a potential solution to the

computation requirement of this task, if the efficient and

scalable parallel algorithms can be designed.

MapReduce is a

patented software framework introduced

by Google in 2004. It is a programming model and an

associated implementation for processing and generating

large data sets in a massively parallel manner [2,6]. Some

data preprocessing, clustering and classification algorithms

have been implemented based on MapReduce [1,8,13].

In this paper, we implemented the parallel Apriori

algorithm based on MapReduce, which makes it applicable

to mine association rules from large databases of

transactions.

The rest of the paper is organized as follows. In

Section 2, we introduce the basic Apriori algorithm. Section

3 gives an overview of MapReduce. In Section 4, we

present the details of the parallel implementation of Apriori

algorithms based on MapReduce. Experimental results and

evaluations are showed in Section 5 with respect to speedup,

scaleup, and sizeup. Finally, Section 6 concludes the paper.

II. A

PRIORI ALGORITHM

A. Problem Statement

The problem of mining association rules over market

basket analysis was introduced in [10]. It consists of finding

associations between items or itemsets in transactional data

[7].

As defined in [11], the problem can be formally stated

as follows. Let

{, , , }

ii i= ! be a set of literals,

called items. Let

be a set of transactions, where each

transaction

is a set of items such that TI⊆ . Each

transaction has a unique identifier TID. A transaction

said to contain

, a set of items in

, if XT⊆ . An

association rule is an implication of the form “

XY ”,

where

XI⊆ , YI⊆ and

XY=∅

Each itemset has an associated measure of statistical

significance called support. For An itemset

, we say its

support is

if the fraction of transactions in

containing

equals

The rule

XY has a support

in the transaction set

if s of the transactions in

contain

. The problem of discovering all

association rules from a set of transactions

consists of

generating the rules that have a support and confidence

greater than given thresholds. These rules are called strong

rules.

This association-mining task can be broken into two

steps:

Step1. The large or frequent itemsets which have support

above the user specified minimum support are generated.

Step2. Generate confident rules from the frequent itemsets.

2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed

Computing

DOI 10.1109/SNPD.2012.31

236

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38553681

粉丝: 2
资源: 915

并行MapReduce实现Apriori算法：大数据集高效挖掘

基于MapReduce的Apriori算法代码及其使用

基于MapReduce的Apriori算法

基于MapReduce的Apriori算法代码

mapreduce实现apriori算法

mapreduce实现apriori算法代码

apriori算法的最新技术原理

apriori算法分布式处理技术

基于mapreduce的kmeans算法

基于MAPREDUCE实现EM算法

基于mapreduce框架的pagerank算法实现

最新资源