An Improved Matrix Sorting Index Association Rule Data Mining
Algorithm
ZHOU Zhiping, WANG Jiefeng
School of Internet of Things Engineering, Jiangnan University, Wuxi 214122
E-mail: zzp@ jiangnan.edu.cn
, 18352513420@163.com
Abstract: Due to the existing Apriori association rules data mining algorithms require to scan the database many times and generate a large
numbers of candidate sets, which produce giant I/O expense issues, result in low data mining computational efficiency. Matrix algorithms
can improve the efficiency in computing frequency 2-itemset, but not delete non-frequency item set before calculation, not effectively
improved efficiency. A matrix-based and sorting index association rules algorithm is proposed. Firstly, delete the unwanted affairs and
items, the frequent binomial set obtained by matrix multiplying and search table, combined with sorting index derived the rest of the
frequency k-itemsets. Compared with Apriori algorithm and matrix algorithm, the proposed algorithm scan database only once, which can
directly find the frequency k-itemsets, especially when frequent item sets are higher or need to have a date mining update, the algorithm has
higher efficiency and feasibility. Experiment shows that proposed matrix sorting index algorithm greatly improved the data mining
efficiency and
scalability.
Key Words:
Data mining, association rules , Apriori algorithm, matrix algorithms, sorting index
1 Introduction
With the continuous developments of internet,
information- based economy and information technology.
Data mining is now widely used in many areas, including
building energy consumption, installme nt industry, etc. Data
mining is the discovery of vast amounts of data from
existing unknown, with a potential value of information or
patterns. When the concept proposed, it immediately
aroused widespread concern in the scientific community.
Association r ule algorithm is an important research direction
of data mining, which focus on establishing links between
different areas of the database to identify the relationship
between satisfying a given support and credibility among
multiple domains [1]. R. Agrawal et al proposed this method,
it o riginally proposed for market basket analysis, which is
aimed at excavating customer purchase behavior association
knowledge to guide commercial sales at transactional
matters dataset, includ ing: merchant scientifically arranged
the purchase, inventory and shelf design. Since the diversity
of situations of da ta, it need to explore effective data mining
methods to obtain useful information from massive data
provide support for intelligent human-computer interaction
[2, 3]. With the develop ment of data mining asso ciation
rules, it has been successfully applied in other areas of
financial securities analysis, telecommunications and
banking risk early warning, which shows its great potential
for development and application prospects. Many scholars
have done a lot of research on this topic and worked for the
development of data mining, which have made a great
contribution. The improvements of traditional association
rule mining are mostly based on Apriori algorithm. The
biggest flaw of Apriori algorithm is necessary to repeatedly
scan the database, which affects the data mining operating
efficiency. Although improved it in many ways later, but the
efficiency is still not very high [4, 5].
P.-G. Cheng
et al proposed NFUP algorithm, which joins
strong large itemsets into small quantitative of candidate
*
This work was supported by the National Natural Science Foundation
(61373126). The research was supported by Jiangsu Province Prospective
Joint Research Project Foundation (BY2013015-33).
itemsets based on strong large itemsets concept, and adopts
early pruning strategy to cut down the times of scanning
database [6]. G. Peng et al based on customer relationship
management system, introduces an improved data mining
association rules apriori algorithm, which deletes lots of
invalid affairs, reduces the records for the followin g scan,
which raises the efficiency of data mining. At the same time ,
with the deduction of the affair, the scale of database is
decreased. Consequently, the scanning time is saved and the
processing efficiency is enhanced [7]. X. Lv et al focus on
the issues about large number of candidate itemsets and the
time of scanning the database, proposed an efficient
algorithm-WARDM for mining the candidate itemsets to
overcome above problems, which can reduce the amount of
candidate itemsets and enhance the execution efficiency [8].
Y.-L. Chen et al firstly identifies the various data types that
may appear in a questionnaire, proposed a unified approach
based on fuzzy techniques, so that all different d ata types
can be handled in a uniform manner [9]. S.-L. Zhang aimed
at the performance bottlenecks of multiply scanning the
database and generating a large quantity of candidate
itemsets in Apriori algorithm, proposed a new algorithm,
which filters out the transact ions unconcerned with mining
targets by a presupposed filter, greatly improves the whole
performance of the algorithm [10]. Z. Kun et al based on RS
theory and associate rules data mining algorithms, counting
core and a reduction algorithm of attributes based on
discrepancy matrix, then put forward a mining model of
association rules with decision attributes based on Apriori,
AprioriTid and Apriori Hybrid algorithms [11]. B. L. Wang
et al proposed a new Apriori algorithm based on Boolean
matrix, It scans transaction database only o nce , thus reduces
the system cost and increases efficiency of data mining [12].
P. Haiwei
et al proposed a new algorithm GMA, which based
on association graph and matrix pruning to reduce the
amount of candidate itemsets. Experimental results show
that it is more efficient for different values of minimum
support [13].
S. Anekritmongkol
et al proposed a new
algorithm based on Boolean algebra compress technique for
association rule data mining (B-Compress), which adopts
tree major ideas: compress data, reduce the amount of times
to scan database tremendously and reduce file size [14, 15].
Proceedings of the 33rd Chinese Control Conference
Jul
28-30, 2014, Nan
in
, China
500