布尔矩阵与排序索引优化的关联规则挖掘算法：高效挖掘频繁项集

需积分: 16 161 浏览量更新于2024-09-09 1 收藏 109KB PDF 举报

本文主要探讨了一种结合布尔矩阵与排序索引的改进关联规则挖掘算法。传统的Apriori算法在挖掘关联规则时存在显著的问题，如频繁地扫描数据库以生成候选集，这导致了大量I/O操作，降低了数据挖掘的效率。布尔矩阵关联规则算法虽然在计算频繁项集时表现出较高的计算速度和较低的内存占用，但其在计算前未对矩阵进行预处理，这在一定程度上增加了计算复杂性。针对这些不足，作者提出了一种创新方法。首先，该算法通过预处理布尔矩阵，剔除了无效的事务和项，通过矩阵乘法和搜索表的方式获取频繁二项集。这种方法能够有效地减少非频繁项集的计算，提高效率。其次，引入排序索引技术，利用索引编号和跳跃搜索机制，可以快速定位频繁项集，进一步加速信息检索过程。在得到频繁二项集后，结合排序索引，可以直接生成更高阶的频繁k-itemsets，无需重复扫描数据库，大大减少了时间和空间复杂度。与Apriori算法和单纯的矩阵算法相比，这种新方法在处理高频繁度或大数量级的项集时具有明显的优势，能有效提升频繁项集挖掘的效率。通过实验验证，提出的算法仅需一次数据库扫描，就能生成所有频繁项集，从而在实际应用中体现出更高的计算性能和资源利用率。该研究提出了一种有效的优化策略，将布尔矩阵的高效计算与排序索引的快速检索相结合，为关联规则挖掘提供了一种更为高效、节省资源的方法，对于大数据环境下提高数据挖掘的实时性和准确性具有重要意义。

An Improved Matrix Sorting Index Association Rule Data Mining

Algorithm

ZHOU Zhiping, WANG Jiefeng

School of Internet of Things Engineering, Jiangnan University, Wuxi 214122

E-mail: zzp@ jiangnan.edu.cn

, 18352513420@163.com

Abstract: Due to the existing Apriori association rules data mining algorithms require to scan the database many times and generate a large

numbers of candidate sets, which produce giant I/O expense issues, result in low data mining computational efficiency. Matrix algorithms

can improve the efficiency in computing frequency 2-itemset, but not delete non-frequency item set before calculation, not effectively

improved efficiency. A matrix-based and sorting index association rules algorithm is proposed. Firstly, delete the unwanted affairs and

items, the frequent binomial set obtained by matrix multiplying and search table, combined with sorting index derived the rest of the

frequency k-itemsets. Compared with Apriori algorithm and matrix algorithm, the proposed algorithm scan database only once, which can

directly find the frequency k-itemsets, especially when frequent item sets are higher or need to have a date mining update, the algorithm has

higher efficiency and feasibility. Experiment shows that proposed matrix sorting index algorithm greatly improved the data mining

efficiency and

scalability.

Key Words:

Data mining, association rules , Apriori algorithm, matrix algorithms, sorting index



1 Introduction

With the continuous developments of internet,

information- based economy and information technology.

Data mining is now widely used in many areas, including

building energy consumption, installme nt industry, etc. Data

mining is the discovery of vast amounts of data from

existing unknown, with a potential value of information or

patterns. When the concept proposed, it immediately

aroused widespread concern in the scientific community.

Association r ule algorithm is an important research direction

of data mining, which focus on establishing links between

different areas of the database to identify the relationship

between satisfying a given support and credibility among

multiple domains [1]. R. Agrawal et al proposed this method,

it o riginally proposed for market basket analysis, which is

aimed at excavating customer purchase behavior association

knowledge to guide commercial sales at transactional

matters dataset, includ ing: merchant scientifically arranged

the purchase, inventory and shelf design. Since the diversity

of situations of da ta, it need to explore effective data mining

methods to obtain useful information from massive data

provide support for intelligent human-computer interaction

[2, 3]. With the develop ment of data mining asso ciation

rules, it has been successfully applied in other areas of

financial securities analysis, telecommunications and

banking risk early warning, which shows its great potential

for development and application prospects. Many scholars

have done a lot of research on this topic and worked for the

development of data mining, which have made a great

contribution. The improvements of traditional association

rule mining are mostly based on Apriori algorithm. The

biggest flaw of Apriori algorithm is necessary to repeatedly

scan the database, which affects the data mining operating

efficiency. Although improved it in many ways later, but the

efficiency is still not very high [4, 5].

P.-G. Cheng

et al proposed NFUP algorithm, which joins

strong large itemsets into small quantitative of candidate

This work was supported by the National Natural Science Foundation

(61373126). The research was supported by Jiangsu Province Prospective

Joint Research Project Foundation (BY2013015-33).

itemsets based on strong large itemsets concept, and adopts

early pruning strategy to cut down the times of scanning

database [6]. G. Peng et al based on customer relationship

management system, introduces an improved data mining

association rules apriori algorithm, which deletes lots of

invalid affairs, reduces the records for the followin g scan,

which raises the efficiency of data mining. At the same time ,

with the deduction of the affair, the scale of database is

decreased. Consequently, the scanning time is saved and the

processing efficiency is enhanced [7]. X. Lv et al focus on

the issues about large number of candidate itemsets and the

time of scanning the database, proposed an efficient

algorithm-WARDM for mining the candidate itemsets to

overcome above problems, which can reduce the amount of

candidate itemsets and enhance the execution efficiency [8].

Y.-L. Chen et al firstly identifies the various data types that

may appear in a questionnaire, proposed a unified approach

based on fuzzy techniques, so that all different d ata types

can be handled in a uniform manner [9]. S.-L. Zhang aimed

at the performance bottlenecks of multiply scanning the

database and generating a large quantity of candidate

itemsets in Apriori algorithm, proposed a new algorithm,

which filters out the transact ions unconcerned with mining

targets by a presupposed filter, greatly improves the whole

performance of the algorithm [10]. Z. Kun et al based on RS

theory and associate rules data mining algorithms, counting

core and a reduction algorithm of attributes based on

discrepancy matrix, then put forward a mining model of

association rules with decision attributes based on Apriori,

AprioriTid and Apriori Hybrid algorithms [11]. B. L. Wang

et al proposed a new Apriori algorithm based on Boolean

matrix, It scans transaction database only o nce , thus reduces

the system cost and increases efficiency of data mining [12].

P. Haiwei

et al proposed a new algorithm GMA, which based

on association graph and matrix pruning to reduce the

amount of candidate itemsets. Experimental results show

that it is more efficient for different values of minimum

support [13].

S. Anekritmongkol

et al proposed a new

algorithm based on Boolean algebra compress technique for

association rule data mining (B-Compress), which adopts

tree major ideas: compress data, reduce the amount of times

to scan database tremendously and reduce file size [14, 15].

Proceedings of the 33rd Chinese Control Conference

Jul

28-30, 2014, Nan

, China

500

下载后可阅读完整内容，剩余5页未读，立即下载

dwf_学海无涯

粉丝: 40
资源: 5

布尔矩阵与排序索引优化的关联规则挖掘算法：高效挖掘频繁项集

布尔矩阵与推荐系统（带学习代码）

web信息处理与应用复习提纲详细总结.pdf

物联网能耗数据分析：智能挖掘与应用平台

关联规则学习：Python购物篮分析案例的探索之旅

如何构建一个简单的倒排索引

搜索算法解析与实践

倒排索引在搜索引擎中的应用

用户查询行为分析与搜索算法优化

【数据挖掘入门】：掌握这3个基本概念和算法，让你少走弯路！

【位操作技巧】：JavaScript中数据结构与算法的隐秘武器

最新资源