PrefixSpan：高效挖掘序列模式的新方法

需积分: 10 147 浏览量更新于2024-09-16 收藏 170KB PDF 举报

"PrefixSpan: 一种通过前缀投影高效挖掘序列模式的方法。" 在数据挖掘领域，序列模式挖掘是一项至关重要的任务，具有广泛的应用。它旨在发现数据中的时间或顺序相关的模式，这些模式可能存在于交易记录、用户行为日志、网络流量等多种场景中。然而，由于可能存在的序列模式数量呈指数级增长，这一问题极具挑战性。大多数早期的序列模式挖掘算法采用“水平”方法，即首先生成所有可能的子序列，然后通过支持度阈值过滤掉不频繁的模式。这种方法虽然能够减少一定的计算量，但在处理大规模数据库或挖掘大量及长序列模式时，仍然面临着效率低下的问题。为此，本文提出了一种新的序列模式挖掘算法——PrefixSpan（前缀投影）。该算法引入了“前缀投影”的概念，通过投影数据结构来避免全组合的枚举，显著降低了计算复杂性。具体来说，PrefixSpan首先以一个单一项目为前缀，找出所有包含该前缀的序列，并将这些序列投影到没有该前缀的新数据库上。这个过程会递归进行，每次增加一个项目，直到达到预设的最小支持度条件。这样，算法能够在每个阶段都减少了需要考虑的序列数量，从而提高了效率。此外，PrefixSpan还使用了一个压缩的数据结构，称为前缀树（Prefix Tree），用于存储和检索序列。前缀树能够有效地表示和遍历所有以特定前缀开头的序列，进一步优化了内存使用和计算速度。在实验部分，文章对比了PrefixSpan与其他现有算法在真实和合成数据集上的性能，结果表明PrefixSpan在挖掘速度和内存效率方面均表现优越，尤其在处理大数据集和长序列模式时，优势更为明显。这使得PrefixSpan成为解决大规模序列模式挖掘问题的一个有力工具。 PrefixSpan是一种创新的序列模式挖掘算法，通过前缀投影和高效的前缀树数据结构，有效地解决了在大数据库和长序列模式下的挖掘效率问题。该方法对于需要快速和高效地发现序列模式的领域，如市场分析、网络监控和生物信息学等，具有很高的实用价值。

PreﬁxSpan: Mining Sequential Patterns Efﬁciently by Preﬁx-Projected Pattern

Growth



Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto

Intelligent Database Systems Research Lab. School of Computing Science, Simon Fraser University

Burnaby, B.C., Canada V5A 1S6 E-mail:



peijian, han, mortazav, hlpinto



@cs.sfu.ca

Qiming Chen Umeshwar Dayal Mei-Chun Hsu

Hewlett-Packard Labs. Palo Alto, California 94303-0969 U.S.A.

E-mail:



qchen, dayal, mchsu



@hpl.hp.com

Abstract

Sequential pattern mining is an important data min-

ing problem with broad applications. It is challenging

since one may need to examine a combinatorially explo-

sive number of possible subsequence patterns. Most of the

previously developed sequential pattern mining methods

follow the methodologyof

 

which may substantially

reduce the number of combinations to be examined. How-

ever,

 

still encounters problems when a sequence

database is large and/or when sequential patterns to be

mined are numerous and/or long.

In this paper, we propose a novel sequential pattern

mining method, called PreﬁxSpan (i.e., Preﬁx

-projected

Sequential pattern mining), which explores preﬁx-

projection in sequential pattern mining. PreﬁxSpan

mines the complete set of patterns but greatly reduces the

efforts of candidate subsequence generation. Moreover,

preﬁx-projection substantiallyreduces the size of projected

databases and leads to efﬁcient processing. Our per-

formance study shows that PreﬁxSpan outperforms both

the

 

-based GSP algorithm and another recently

proposed method, FreeSpan, in mining large sequence

databases.

1 Introduction

Sequential pattern mining, which discovers frequent

subsequences as patterns in a sequence database, is an im-

portant data mining problem with broad applications, in-

cluding the analyses of customer purchase behavior, Web

access patterns, scientiﬁc experiments, disease treatments,

natural disasters, DNA sequences, and so on.



The work was supported in part by the Natural Sciences and En-

gineering Research Council of Canada (grant NSERC-A3723), the Net-

works of Centres of Excellence of Canada (grant NCE/IRIS-3), and the

Hewlett-Packard Lab, U.S.A.

The sequential pattern mining problem was ﬁrst intro-

duced by Agrawal and Srikant in [2]: Given a set of se-

quences, where each sequence consists of a list of elements

and each element consists of a set of items, and given

a user-speciﬁed min

support threshold, sequential pattern

mining is to ﬁnd all of the frequent subsequences, i.e., the

subsequences whose occurrence frequency in the set of se-

quences is no less than min support.

Many studies have contributed to the efﬁcient mining

of sequential patterns or other frequent patterns in time-

related data, e.g., [2, 11, 9, 10, 3, 8, 5, 4]. Almost all

of the previously proposed methods for mining sequen-

tial patterns and other time-related frequent patterns are

 

-like, i.e., based on the

 

property proposed

in association mining [1], which states the fact that any

super-pattern of a nonfrequent pattern cannot be frequent.

Based on this heuristic, a typical

 

-like method

such as GSP [11] adopts a multiple-pass, candidate-

generation-and-test approach in sequential pattern mining.

This is outlined as follows. The ﬁrst scan ﬁnds all of the

frequent items which form the set of single item frequent

sequences. Each subsequent pass starts with a seed set of

sequential patterns, which is the set of sequential patterns

found in the previous pass. This seed set is used to gen-

erate new potential patterns, called candidate sequences.

Each candidate sequence contains one more item than a

seed sequential pattern, where each element in the pattern

may containone or multiple items. The number of items in

a sequence is called the length of the sequence. So, all the

candidate sequences in a pass will have the same length.

The scan of the database in one pass ﬁnds the support for

each candidate sequence. All of the candidates whose sup-

port in the database is no less than min support form the

set of the newly found sequential patterns. This set then

becomes the seed set for the next pass. The algorithm ter-

minates when no new sequential pattern is found in a pass,

or no candidate sequence can be generated.

Similar to the analysis of

 

frequent pattern min-

下载后可阅读完整内容，剩余9页未读，立即下载

thssla21

粉丝: 5
资源: 142

PrefixSpan：高效挖掘序列模式的新方法

Vue.js插件实现优雅顺序动画：vue-sequential-entrance

sequential-guid：生成顺序唯一标识符的JavaScript库

使用通配符挖掘序列模式——MAIL算法

Mining Sequential Patterns

Mining Sequential Patterns.txt

bios sequential tagger-开源

Sequential anaerobic-aerobic treatment for domestic wastewater - A review

Sequential Analysis - Hypothesis Testing and Changepoint Detection

最新资源