RESEARCH ARTICLE
Copyright © 2014 American Scientific Publishers
All rights reserved
Printed in the United States of America
Journal of
Computational and Theoretical Nanoscience
Vol. 11, 1–6, 2014
Compacted Implementations of
Deterministic Finite Automata
Meng Zhang
1 ∗
, Yi Zhang
2
, Wei Lv
1
, and Chen Hou
1
1
College of Computer Science and Technology, Jilin University, Changchun, China
2
Department of Computer Science, Jilin Business and Technology College, Changchun, China
Automaton is a popular data structure with many applications. We present implementation methods
of deterministic finite automaton that allow both efficient space usage and performance. The first
method utilizes the succinct data structures to speedup the computation of the transition function.
The method archives O(log log ) time state transition with little augmented memory where is the
size of the alphabet. The second method, namely overlapping table, reduces the space require-
ments of automata implemented by the state transition table approach. It makes the transition tables
of states share the overlapping address space thus reduces the space usage. The method has an
O(1) time state transition, while using fewer memory.
Keywords: Automaton, Compression, Succinct Data Structure.
1. INTRODUCTION
The deterministic finite automaton (DFA) is one of the
most popular data structures.
1 2
It is used in many fields
of computer science, such as data compression, pattern
matching, regular expression matching and text indexing.
Several DFAs used in these applications have been pro-
posed, such as Aho-Corasick automata,
3
automata for rec-
ognizing regular expressions,
4
suffix automata
5 6
and the
factor oracle.
7
The AC automaton solves the multiple pat-
tern matching problem in Onlog time ( is the size of
the alphabet, n is the length of the input text) which gen-
eralizes the Knuth–Morris–Pratt algorithm.
8
The AC algo-
rithm is extensively used in practice, such as deep packet
inspection. The suffix automaton presented by Blumer
et al.
5
is the minimal deterministic automaton accepting
the set of suffixes of a text.
6
Suffix automaton is one of
the core data structures of several efficient string match-
ing algorithms.
9
The Factor Oracle
7
is derived from the
DAWG. It can recognize more than the exact factors of a
string to achieve simplicity and low memory requirements.
The factor oracle is also applied in pattern matching, data
compression and machine learning.
The high space usage of automata limits their appli-
cability. In this paper, we focus on the implementation
of automata that allows both efficient storage and usage.
We don’t study the techniques for compacting specific
automata, such as reducing the number of states or edges
∗
Author to whom correspondence should be addressed.
or using the NFA (non-determined finite automaton) to
simulating the DFA. The problem that we consider is
to represent the general DFA space economically while
not slowing down the performance. Our techniques are
general-purpose that can be used in representing any type
of DFAs. We will use automata to refer to DFAs in
the rest of the paper. We give two representing meth-
ods. The first one utilizes the succinct data structures
10
to speedup the computation of the transition function,
archiving O(log log ) running time. The second one,
namely overlapping table, reduces the space requirements
of automata implemented by the state transition table
approach. In the case of constant alphabets, each state
in a DFA has its own -entry transition table. There are
empty entries in the transition tables. Our approach makes
the transition tables overlap, that is, tables share some
space such that an empty entry of one table is used by
a non-empty entry of another table. To solve the prob-
lem of determining the ownership of table entries, we
introduce a method that uses log -bit labels to identify
each table. The total memory of the tables is reduced
by reusing empty entries. The problem of generating the
minimal overlapping table is an optimization problem.
This problem can be reduced to the shortest common
superstring problem (SCS for short). Given a substring-
free set of strings P, the SCS asks for a shortest com-
mon superstring of P , that is, a minimum-length string
containing all strings from P as substrings. In computa-
tional biology,
11–13
the SCS is used for the DNA frag-
ment assembly problem.
11
The SCS is NP hard
14
and even
J. Comput. Theor. Nanosci. 2014, Vol. 11, No. 3 1546-1955/2014/11/001/006 doi:10.1166/jctn.2014.3443 1