PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern
Growth
Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto
Intelligent Database Systems Research Lab. School of Computing Science, Simon Fraser University
Burnaby, B.C., Canada V5A 1S6 E-mail:
peijian, han, mortazav, hlpinto
@cs.sfu.ca
Qiming Chen Umeshwar Dayal Mei-Chun Hsu
Hewlett-Packard Labs. Palo Alto, California 94303-0969 U.S.A.
E-mail:
qchen, dayal, mchsu
@hpl.hp.com
Abstract
Sequential pattern mining is an important data min-
ing problem with broad applications. It is challenging
since one may need to examine a combinatorially explo-
sive number of possible subsequence patterns. Most of the
previously developed sequential pattern mining methods
follow the methodologyof
which may substantially
reduce the number of combinations to be examined. How-
ever,
still encounters problems when a sequence
database is large and/or when sequential patterns to be
mined are numerous and/or long.
In this paper, we propose a novel sequential pattern
mining method, called PrefixSpan (i.e., Prefix
-projected
Sequential pattern mining), which explores prefix-
projection in sequential pattern mining. PrefixSpan
mines the complete set of patterns but greatly reduces the
efforts of candidate subsequence generation. Moreover,
prefix-projection substantiallyreduces the size of projected
databases and leads to efficient processing. Our per-
formance study shows that PrefixSpan outperforms both
the
-based GSP algorithm and another recently
proposed method, FreeSpan, in mining large sequence
databases.
1 Introduction
Sequential pattern mining, which discovers frequent
subsequences as patterns in a sequence database, is an im-
portant data mining problem with broad applications, in-
cluding the analyses of customer purchase behavior, Web
access patterns, scientific experiments, disease treatments,
natural disasters, DNA sequences, and so on.
The work was supported in part by the Natural Sciences and En-
gineering Research Council of Canada (grant NSERC-A3723), the Net-
works of Centres of Excellence of Canada (grant NCE/IRIS-3), and the
Hewlett-Packard Lab, U.S.A.
The sequential pattern mining problem was first intro-
duced by Agrawal and Srikant in [2]: Given a set of se-
quences, where each sequence consists of a list of elements
and each element consists of a set of items, and given
a user-specified min
support threshold, sequential pattern
mining is to find all of the frequent subsequences, i.e., the
subsequences whose occurrence frequency in the set of se-
quences is no less than min support.
Many studies have contributed to the efficient mining
of sequential patterns or other frequent patterns in time-
related data, e.g., [2, 11, 9, 10, 3, 8, 5, 4]. Almost all
of the previously proposed methods for mining sequen-
tial patterns and other time-related frequent patterns are
-like, i.e., based on the
property proposed
in association mining [1], which states the fact that any
super-pattern of a nonfrequent pattern cannot be frequent.
Based on this heuristic, a typical
-like method
such as GSP [11] adopts a multiple-pass, candidate-
generation-and-test approach in sequential pattern mining.
This is outlined as follows. The first scan finds all of the
frequent items which form the set of single item frequent
sequences. Each subsequent pass starts with a seed set of
sequential patterns, which is the set of sequential patterns
found in the previous pass. This seed set is used to gen-
erate new potential patterns, called candidate sequences.
Each candidate sequence contains one more item than a
seed sequential pattern, where each element in the pattern
may containone or multiple items. The number of items in
a sequence is called the length of the sequence. So, all the
candidate sequences in a pass will have the same length.
The scan of the database in one pass finds the support for
each candidate sequence. All of the candidates whose sup-
port in the database is no less than min support form the
set of the newly found sequential patterns. This set then
becomes the seed set for the next pass. The algorithm ter-
minates when no new sequential pattern is found in a pass,
or no candidate sequence can be generated.
Similar to the analysis of
frequent pattern min-