两阶段方法挖掘事务数据库中的短期高实用项目集

16 浏览量更新于2024-07-15 收藏 833KB PDF 举报

本文主要探讨了事务数据库中针对矿山短期高实用项目集（short-period high-utility itemsets, SPHUIs）的两阶段发现方法。作者Jerry Chun-Wei Lin、Jiexiong Zhang等人，分别来自中国深圳哈工大深圳研究生院的计算机科学与技术学院、自然科学与人文学院，以及台湾高雄的国立高雄大学和中山大学计算机科学与工程系，还有澳大利亚南昆士兰大学农业、计算与环境科学学院，共同合作研究了这一领域的重要问题。近年来，高实用性项目集的研究在数据挖掘领域日益受到关注，特别是在商业智能和市场篮子分析中，高 utility（效用）的项集能够揭示消费者的购买模式和偏好，有助于制定更有效的营销策略。然而，短期高效用项目集（periodic high-utility itemsets）的发现相较于传统的高效用项目集（high-utility itemsets, HUIs），其挑战在于数据的时效性和动态变化。传统的单阶段算法可能无法满足这种实时或周期性更新的需求，因此提出了一种两阶段的方法来解决这个问题。首先，该两阶段方法的第一阶段是预处理阶段，它对事务数据库进行快速扫描，通过剪枝和过滤等策略降低后续处理的复杂度，识别出具有潜在高效用的候选项目集。这一步骤利用了启发式算法，如基于阈值的启发式规则，以减少搜索空间，提高效率。第二阶段则是优化和确认阶段，针对第一阶段筛选出的候选项目集，采用更为精细的计算方法，如基于迭代或递归的算法，对数据的实时状态进行深入挖掘，确保得到真正的短期高效用项目集。这一步可能涉及到增量计算或者在线学习，以适应不断变化的数据流。文章的关键字包括“高效用项目集”、“定期高效用项目集”、“SPHUIs”和“两阶段”，这些关键词反映了论文的核心内容和研究重点。作者们在2016年5月3日收到初稿，经过修订后于2017年2月9日再次提交，并在同年4月29日接受发表。这项研究对于理解和提升事务数据库中短期高效用项目集的挖掘性能具有重要的理论和实际价值，为相关领域的数据分析师和系统开发者提供了新的解决方案和优化思路。

and the sale periods of itemsets. Mining SPHUIs provides the

beneﬁt of revealing the most popular itemsets in each short

period that also have high utilities. This type of patterns is more

interesting than HUIs for real-life applications such as market

basket analysis.

2. An efﬁcient two-phase short-period high-utility itemset mining

(SPHUI

) algorithm is presented to discover SPHUIs in a level-

wise manner. In its ﬁrst phase, the algorithm utilizes an upper-

bound on the utilities of itemsets to ﬁnd a small set of candidate

SPHUIs. Then, in the second phase, the algorithm identiﬁes the

actual SPHUIs in this set of candidates.

3. To speed up the baseline SPHUI

algorithm, two pruning

strategies are further introduced. The resulting algorithms are

respectively named SPHUI

and SPHUI

TID

. The two strategies

are designed to reduce the search space for mining SPHUIs.

4. Experiments on various real-life and synthetic datasets are then

conducted to evaluate the performance of the three designed

approaches in terms of runtime, memory usage, number of can-

didates, number of patterns found, and scalability, on several

datasets.

The rest of this paper is organized as follows. Related work is

reviewed in Section 2. Section 3 introduces preliminaries and

deﬁnes the problem of SPHUIM. Section 4 provides details about

the proposed algorithm and presents the two pruning strategies.

A detailed example of how the proposed algorithm is applied on

a small database is presented in Section 5. Results of an extensive

experimental evaluation are discussed in Section 6. Finally, Sec-

tion 7 provides a conclusion and a discussion.

2. Related work

In data mining, the tasks of Association-rule mining (ARM) and

frequent itemset mining (FIM) are considered as fundamental, and

have thus attracted the interest of numerous researchers [2,13,14].

Agrawal et al. [1] ﬁrst proposed the Apriori algorithm for discover-

ing association rules in a level-wise manner. The Apriori algorithm

ﬁrst generates a large number of candidates (FIs) by considering a

minimum support threshold. Then, these itemsets are used to

reveal the association rules that respect a minimum conﬁdence

threshold. This process allows discovering all association rules that

meet the minimum support and conﬁdence thresholds set by the

user. However, this process is very time-consuming and can result

in long execution times and the consumption of a large amount of

memory. To address this issue, Han et al. designed a pattern-

growth algorithm called FP-Growth [2], which builds a compressed

tree structure using frequent items, and then derives all FIs from

the constructed tree structure without generating candidates.

Thereafter, several other algorithms have been proposed for FIM

and HUIM. But most of them are either based on the level-wise

or the pattern-growth approaches. In general, pattern-growth

methods perform better than level-wise approaches but are more

difﬁcult to implement. Researchers have also proposed numerous

variations of the FIM problem such as sequential pattern mining

[15], top-k frequent pattern mining [14], frequent closed itemset

mining [3], maximal frequent itemset mining [13] and high utility

sequential pattern mining [16,17].

HUIM can be considered as an extension of ARM. It has been

extensively studied in recent decades [18–20] since it has numer-

ous applications in various ﬁelds of science and engineering [10].A

typical application of HUIM is market basket analysis. In this appli-

cation, HUIM provides crucial information to managers and deci-

sion makers for designing proﬁtable sale strategies or taking

other strategic decisions. In HUIM, the purchase quantities in

transactions and the unit proﬁts of items are considered to ﬁnd

the itemsets that are highly proﬁtable, called the high-utility item-

sets (HUIs). These itemsets are useful for managers and retailers as

they are the patterns that yield a high proﬁt. To extract the set of

high-utility itemsets from a database, several algorithms have been

designed. Chan et al. [4] presented a framework to mine the top-k

closed utility patterns based on business objectives. Their approach

discovers not only frequent itemsets (FIs) but also HUIs. Yao et al.

[21] then deﬁned utility mining as the problem of discovering prof-

itable itemsets in transactional databases by considering both the

purchase quantities of items in transactions and their unit proﬁts.

A key difference between the tasks of FIM and HUIM is that the

downward closure property, also called the Apriori property, does

not hold in HUIM [21]. The Apriori property indicates that the

occurrence frequency of an itemset cannot be less than that of its

supersets. This property is widely used in FIM to reduce the num-

ber of candidate FIs.

In HUIM, this property does not hold since the utility of an item-

set can be less than, greater or equal to the utility of any of its

supersets. To obtain a downward closure property in HUIM and

be able to reduce the search space for mining HUIs, Liu et al. [22]

introduced a model called the transaction-weighted utilization

(TWU). This model consists of calculating an upper-bound on the

utility of itemsets that is downward-closed, called the

transaction-weighted downward closure (TWDC) property. A

two-phase algorithm was also designed to extract HUIs in transac-

tional databases. The ﬁrst phase consists of mining the high

transaction-weighted utilization itemsets (HTWUIs) in a level-

wise manner. Then, the second phase consists of identifying the

HUIs among the HTWUIs. The aforementioned studies utilize a

generate-and-test approach where itemsets are discovered level-

by-level. This strategy requires to generate a large number of can-

didates. Thus, these algorithms have high time and space

complexities.

Ahmed et al. [23] then designed a tree-based structure named

HUP-tree and a corresponding algorithm to mine HUIs. The algo-

rithm utilizes the TWU model to ﬁnd 1-HTWUIs (HTWUIs contain-

ing only one item) and keeps information in its tree structure to

support incremental and interactive mining. Lin et al. [24] then

presented the high-utility pattern (HUP)-growth algorithm for

mining HUIs. This algorithm is based on the TWU model and a

novel HUP-tree structure. The tree structure maintains information

about the purchase quantities of items in transactions to improve

the mining performance and discover HUIs without candidate gen-

eration. Tseng et al. proposed the UP-growth [25] and UP-growth+

[10] algorithms and several pruning strategies to mine the set of

HUIs based on their designed UP-tree structure. Nevertheless, the

above approaches have high time and space complexity, and the

search space in terms of number of itemsets considered by these

algorithms is very large.

To speed up the discovery of HUIs, Lan et al. [26] developed an

efﬁcient projection-based indexing approach and a pruning strat-

egy to reduce the number of candidate itemsets. Liu et al. [27] pro-

posed a novel utility-list structure to keep the information

required for mining HUIs. They developed an algorithm named

HUI-Miner to directly discover HUIs without candidate generation.

The utility-list structure stores important information for identify-

ing HUIs: 1. the IDs of transactions where an itemset appears; 2.

the actual utilities of the itemset in these transactions; and 3. the

utility of items that could be appended to the itemset in these

transactions. Fournier-Viger et al. [18] further designed an Esti-

mated Utility Co-occurrence Structure (EUCS) to store information

about the relationships between 2-itemsets and proposed the FHM

algorithm to mine HUIs based on the utility-list structure. By rely-

ing on the EUCS, 2-itemsets can be easily discovered and

unpromising candidates can also be easily pruned. The develop-

ment of HUIM algorithms is a very active research area, and novel

J.C.-W. Lin et al. / Advanced Engineering Informatics 33 (2017) 29–43

剩余14页未读，继续阅读

weixin_38670949

粉丝: 8
资源: 983

两阶段方法挖掘事务数据库中的短期高实用项目集

确定性分布式数据库中长事务处理方法研究.pdf

数据库事务在项目中的应用.mp4

spring事务 数据库事务

数据库中的两阶段提交

在分布式数据库中的事务处理

spring事务 数据库锁

关联规则事务数据库概念

spring事务和数据库事务的区别

redis事务和数据库事务

数据库事务 spring事务

最新资源

spring事务数据库事务

spring事务数据库锁