HPFP-Miner: 并行频繁项集挖掘算法的高效解决方案

需积分: 9 41 浏览量更新于2024-09-07 收藏 873KB PDF 举报

在数据挖掘领域，频繁项集挖掘是一项基础且至关重要的任务，它广泛应用于市场篮子分析、关联规则学习等多个场景中。随着数据规模的迅速增长，单个处理器处理大数据集的效率逐渐成为瓶颈，因此，研究高效的并行频繁项集挖掘算法变得尤为重要。本文提出了一种新颖的并行频繁项集挖掘算法，名为HPFP-Miner。 HPFP-Miner的设计灵感源于经典的FP-Growth算法，该算法在挖掘频繁项集时，通过构建频繁项集的树状结构来减少数据处理的复杂性。然而，传统FP-Growth算法在处理大规模数据时，通信开销较大，因为它涉及到多个处理器之间的频繁数据交换。为了解决这一问题，HPFP-Miner通过优化元素列表的划分策略，有效地减少了处理器间的通信成本。算法的关键在于将频繁元素列表划分到不同的处理器上，确保每个处理器负责的部分相对均衡，从而实现了负载均衡和性能提升。实验结果显示，HPFP-Miner在保持高精度的同时，展示了良好的可扩展性和性能。随着数据库规模的增大，其运行时间的增长速度明显低于传统的非并行方法，这表明其在大规模数据环境下的优势明显。此外，算法设计巧妙地平衡了计算负载和通信开销，使得它在实际应用中具有很高的实用性。总结来说，HPFP-Miner作为一项并行频繁项集挖掘的创新解决方案，通过改进FP-Growth算法的并行化策略，实现了高效的数据处理和低通信开销，为大规模数据集的频繁项集挖掘提供了新的可能。这对于提升数据挖掘任务的执行效率和处理能力，特别是在云计算和分布式计算环境下，具有重要的理论和实际价值。未来的研究可以进一步优化算法的并行度和资源调度，以适应更复杂的实时数据分析需求。

HPFP-Miner: A Novel Parallel Frequent Itemset Mining Algorithm

CHEN Xiaoyun

, HE Yanshan

, CHEN Pengfei

, MIAO Shengfa

, SONG Weiguo

, YUE Min

1.School of Information Science and Engineering

Lanzhou University

Lanzhou Gansu, PRC

e-mail: heysh07@lzu.cn

Abstract—Frequent itemset mining is a fundamental and

essential issue in data mining field and can be used in many

data mining tasks. Most of these mining tasks require multiple

passes over the database and if the database size is large,

which is usually the case, scalable high performance solutions

involving multiple processors are required. In this paper, we

present a novel parallel frequent itemset mining algorithm

which is called HPFP-Miner. The proposed algorithm is based

on FP-Growth and introduces little communication overheads

by efficiently partitioning the list of frequent elements list over

processors. The results of experiment show that HPFP-Miner

has good scalability and performance.

Keywords-frequent itemset; data mining; parallel; HPFP-

Miner; FP-Growth

I. INTRODUCTION

Frequent itemset mining is a fundamental and essential

issue in data mining field and can be used in many data

mining tasks which include association rules analysis,

correlations analysis, classification, clustering, outlier

analysis, etc. Since 1993 Agrawal et al.[1] proposed

association rules analysis problem through research of the

relationship among itemsets in custom transaction database

and proved the core of association rules analysis problem

was mining the frequent itemsets which could generate

strong association rules, the issue of mining frequent

itemsets has been an active research topic.

However, the huge search space makes frequent pattern

mining on large databases a very computationally intensive

problem. The performance of a serial frequent pattern

mining algorithm is limited by its single processor

computing capability and finite memory space; thus, they do

not scale to large datasets. Therefore, it is very necessary to

exploit high performance parallel/distributed computing to

relieve the current frequent pattern mining algorithms from

the sequential bottleneck, and helps the FPM algorithms

scale to massive datasets[2].

FP-Growth[3] algorithm proposed by J.Han in 2000 is

one of the most popular frequent itemset mining algorithm.

It adopts divide-and-conquer method which compresses

transaction database to a frequent pattern tree—FP-tree first,

then mines complete frequent itemsets from FP-tree. The

intrinsic divide-and-conquer nature of FP-growth can easily

be parallelized. In 2001, Osmar R. Zaïane et al. proposed

MLFPT[4] which is a simple parallel algorithm based on

FP-growth for mining frequent itemsets on shared memory

machines. The key idea behind this task parallel paradigm is

the divide-and-conquer nature of the FP-growth algorithm.

Later, Javed and Khokhar et al. proposed another FP-

growth-based parallel algorithm called PFP-tree[5] but

running on a distributed memory architecture. The FP-tree

building process is similar to that in MLFPT except that

multiple FP-trees reside in different physical memory spaces.

The mining task is also divided in the same way as in

MLFPT. Another difference is that the conditional pattern

base of each 1-itemset must be exchanged among processors

so that each can hold the needed portion of the database

locally.

In this paper, we present a novel parallel frequent itemset

mining algorithm, called HPFP-Miner, to mine the complete

set of frequent itemsets on parallel distributed memory

architecture. We implement HPFP-Miner on four real world

datasets, and compare the results with two famous FP-

Growth-like algorithms. The experimental results show that

our algorithm has a better scalability and performance.

II.

ELATED WORK

For the sake of completeness, we give the relative

problem description of frequent itemset mining and FP-

Growth in this section.

Definition 1: Let I={i

,ĂĂ,i

} be a set of items,

where an itemset X is a subset of I such that

I

. An

itemset of length k is called k-itemset. Let D be a set of

transactions, where each transaction T is a set of items such

that . A unique identified TID is given to each

transaction. The support of an itemset X is the number of

transactions in D that contain X. If the support of X is

greater than or equal to a predefined support threshold

TD

, it

is called a frequent itemset or frequent pattern, otherwise, it

is infrequent.

The frequent pattern mining problem is to discover the

complete set of all patterns contained in at least a specified

support threshold

, of transactions in the transaction

database.

FP-Growth-like algorithms adopt divide-and-conquer

method which can be stated as follow: First, it compresses

the database representing frequent items into a frequent-

pattern tree, or FP-tree, which retains the itemset association

information. It then divides the compressed database into a

set of conditional databases (a special kind of projected

database), each associated with one frequent item or “pattern

fragment,” and mines each such database separately[6].

FP-Growth opened up a new way to efficiently mine

frequent pattern. However, its low time and space utilization