308 Int. J. Wireless and Mobile Computing, Vol. 5, No. 3, 2012
Copyright © 2012 Inderscience Enterprises Ltd.
Space-efficient multiple string matching automata
Meng Zhang*, Tianyu Yang and Rui Wu
College of Computer Science and Technology,
Jilin University,
Changchun, China
Email: zhangmeng@jlu.edu.cn
Email: tricky1997@qq.com
Email: 1245023548@qq.com
*Corresponding author
Abstract: Aho–Corasick (AC) automaton is a data structure for multiple string matching. We
present two compressing methods that enable the AC automaton to work on systems with limited
resource such as mobile devices. By the first method, the AC automaton for a pattern set P over
an alphabet of size
σ
needs (
σ
+ 1)I + (1 + log|P| + logM)M + o(M) bits where M and I are the
number of states and the number of non-leaf states of the AC automaton respectively, and a state
transition takes O(1) time. By the second method, the space is I + (1 + log|P| + logM + log
σ
)M +
o(M log
σ
) bits, and a state transition takes O(log log
σ
) time. We then combine the two methods
together and archive trade-offs between the space and time complexity.
Keywords: multiple string matching; Aho–Corasick automaton; succinct data structure.
Reference to this paper should be made as follows: Zhang, M., Yang, T. and Wu, R. (2012)
‘Space-efficient multiple string matching automata’, Int. J. Wireless and Mobile Computing,
Vol. 5, No. 3, pp.308–313.
Biographical notes: Meng Zhang is currently an Associate Professor in College of Computer Science
and Technology, Jilin University. He received his PhD in Computer Science from Jilin University in
2003. His main research interests include stringology, network security and computational biology.
Tianyu Yang is currently a third year undergraduate student in College of Computer Science and
Technology at Jilin University. His research interests include algorithm design and network security.
Rui Wu is currently a third year undergraduate student in College of Computer Science and
Technology at Jilin University. His research interests include algorithm design and network security.
1 Introduction
The
problem of multiple string matching is to find a set of strings
P = {p
1
, …, p
k
} in a text t. Throughout the paper, k denotes the
number of patterns, n denotes the length of t, Σ = {a
1
, a
2
, …,
a
σ
} denotes the alphabet of
σ
symbols from which the symbols
in P and t are chosen. Much research focus on multiple string
matching problem. The first algorithm to solve this problem in
O(n log
σ
) time is presented by Aho and Corasick (1975) and
Cormen et al. (2001) which generalises the Knuth–Morris–Pratt
algorithm (Knuth et al., 1977). The Commentz-Walter (1979)
algorithm is a direct extension of the Boyer–Moore algorithm
(Boyer and Moore, 1977) which also combines the idea of AC
algorithm. There are also several factors recognition approach-
based algorithms (Raffinot, 1997; Allauzen and Raffinot, 1999;
Crochemore et al., 1999) that use either suffix automata or
factor oracles for precise or weak factor recognition. For short
patterns, bit parallelism leads to algorithms that are efficient in
practice (see Navarro and Raffinot, 2002). Though presented
30 years ago, AC algorithm is still extensively used in practice.
The AC-automaton is also the core data structure in deep
packet inspection systems (Song et al., 2008; Snort Intrusion
Detection System, 2012, see http://www.snort.org) and anti-
virus systems (ClamAV, 2012; http://www.clamav.net).
Succinct data structures (Jacobson, 1989; Munro, 1996;
Munro and Raman, 1997; Ferragina and Manzini, 2000; Grossi
and Vitter, 2000; He et al., 2005) recently have received much
attention. A succinct data structure stores objects using
space close to the information-theoretic lower bound, while
simultaneously supporting a number of primitive operations on
the objects in constant time. The compressed suffix array
structure proposed by Grossi and Vitter (2000) is the first
method that reduces the size of the suffix array from O(n log n)
bits to O(n) bits and supports access to any entry of the original
suffix array in
)
logOn
σ
∈
time, for any fixed constant 0 < ∈
< 1 (without computing the entire original suffix array).
FM-index proposed by Ferragina and Manzini (2000) is a self-
index data structure with good compression ratio and fast
decompressing speed. The FM-index occupies at most 5nH
k
(T)
+ o(n) bits of storage, where H
k
is the k-th order entropy of T
and allows the search for the occ occurrences of a pattern
p[1…m] within T in O(m + occ log
1+∈
n) time.
In this paper, we focus on compacting AC automata to
make it possible to work on systems with limited resource. The
major drawback that limits the applicability of AC automata is
their space complexity. We present space compacting methods
for AC automata based on succinct data structure technologies