紧凑型确定有限自动机实现：空间效率与性能优化

106 浏览量更新于2024-08-29 收藏 137KB PDF 举报

本文主要探讨了两种紧凑实现方法，针对的是在计算机科学中广泛应用的确定性有限自动机（Deterministic Finite Automaton, DFA）。确定性有限自动机是理论计算机科学中的基础构造，用于处理字符串模式匹配、语言识别等任务。本文的焦点在于提升空间效率和性能。首先，作者提出了利用精简数据结构（Succinct Data Structures）来优化DFA的实现。这种数据结构通过压缩技术，显著地减少了存储空间的需求。在处理字母表大小为sigma的情况下，这种方法能够实现O(log log sigma)的时间复杂度进行状态转换，这意味着随着字母表规模的增长，所需的额外内存保持在一个非常小的量级上，这对于处理大规模输入尤其有利。接着，另一种创新方法被称为“重叠表”（Overlapping Table）。它针对传统的状态转移表（State Transition Table）实现策略进行了改进。通过设计，不同状态的转换表共享部分地址空间，从而实现了空间的共享和减少。这种方法的优势在于显著降低存储需求，特别是对于那些状态间存在大量共性的自动机，可以节省大量空间，使得整体系统更加高效。本文的研究旨在提供两种高效的DFA实现策略，兼顾了空间和性能的优化。在当前大数据和高计算效率要求的背景下，这些紧凑实施方法对于提高有限自动机在实际应用中的表现具有重要意义，尤其是在文本处理、编译器、密码学等领域，能够显著提升系统的整体效能。同时，这些方法也为研究者提供了新的视角，促进了未来在数据结构和算法领域的进一步发展。

RESEARCH ARTICLE

Printed in the United States of America

Journal of

Computational and Theoretical Nanoscience

Vol. 11, 1–6, 2014

Compacted Implementations of

Deterministic Finite Automata

Meng Zhang

1 ∗

, Yi Zhang

, Wei Lv

, and Chen Hou

College of Computer Science and Technology, Jilin University, Changchun, China

Department of Computer Science, Jilin Business and Technology College, Changchun, China

Automaton is a popular data structure with many applications. We present implementation methods

of deterministic ﬁnite automaton that allow both efﬁcient space usage and performance. The ﬁrst

method utilizes the succinct data structures to speedup the computation of the transition function.

The method archives O(log log  ) time state transition with little augmented memory where  is the

size of the alphabet. The second method, namely overlapping table, reduces the space require-

ments of automata implemented by the state transition table approach. It makes the transition tables

of states share the overlapping address space thus reduces the space usage. The method has an

O(1) time state transition, while using fewer memory.

Keywords: Automaton, Compression, Succinct Data Structure.

1. INTRODUCTION

The deterministic ﬁnite automaton (DFA) is one of the

most popular data structures.

1 2

It is used in many ﬁelds

of computer science, such as data compression, pattern

matching, regular expression matching and text indexing.

Several DFAs used in these applications have been pro-

posed, such as Aho-Corasick automata,

automata for rec-

ognizing regular expressions,

sufﬁx automata

5 6

and the

factor oracle.

The AC automaton solves the multiple pat-

tern matching problem in Onlog  time ( is the size of

the alphabet, n is the length of the input text) which gen-

eralizes the Knuth–Morris–Pratt algorithm.

The AC algo-

rithm is extensively used in practice, such as deep packet

inspection. The sufﬁx automaton presented by Blumer

et al.

is the minimal deterministic automaton accepting

the set of sufﬁxes of a text.

Sufﬁx automaton is one of

the core data structures of several efﬁcient string match-

ing algorithms.

The Factor Oracle

is derived from the

DAWG. It can recognize more than the exact factors of a

string to achieve simplicity and low memory requirements.

The factor oracle is also applied in pattern matching, data

compression and machine learning.

The high space usage of automata limits their appli-

cability. In this paper, we focus on the implementation

of automata that allows both efﬁcient storage and usage.

We don’t study the techniques for compacting speciﬁc

automata, such as reducing the number of states or edges

∗

Author to whom correspondence should be addressed.

or using the NFA (non-determined ﬁnite automaton) to

simulating the DFA. The problem that we consider is

to represent the general DFA space economically while

not slowing down the performance. Our techniques are

general-purpose that can be used in representing any type

of DFAs. We will use automata to refer to DFAs in

the rest of the paper. We give two representing meth-

ods. The ﬁrst one utilizes the succinct data structures

to speedup the computation of the transition function,

archiving O(log log ) running time. The second one,

namely overlapping table, reduces the space requirements

of automata implemented by the state transition table

approach. In the case of constant alphabets, each state

in a DFA has its own -entry transition table. There are

empty entries in the transition tables. Our approach makes

the transition tables overlap, that is, tables share some

space such that an empty entry of one table is used by

a non-empty entry of another table. To solve the prob-

lem of determining the ownership of table entries, we

introduce a method that uses log -bit labels to identify

each table. The total memory of the tables is reduced

by reusing empty entries. The problem of generating the

minimal overlapping table is an optimization problem.

This problem can be reduced to the shortest common

superstring problem (SCS for short). Given a substring-

free set of strings P, the SCS asks for a shortest com-

mon superstring of P , that is, a minimum-length string

containing all strings from P as substrings. In computa-

tional biology,

11–13

the SCS is used for the DNA frag-

ment assembly problem.

The SCS is NP hard

and even

J. Comput. Theor. Nanosci. 2014, Vol. 11, No. 3 1546-1955/2014/11/001/006 doi:10.1166/jctn.2014.3443 1

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38747592

粉丝: 6
资源: 937

紧凑型确定有限自动机实现：空间效率与性能优化

BitArray_src.zip_As One_bit array

Numerical modelling of swelling and shrinking soils around

EIT - The Internal Extent Formula for Compacted Tries-计算机科学

查找sintercast生产蠕墨铸铁的文献专利

手持式机械振动英文文献

c#隐藏selenium特征

Compacted Sewage Sludge as a Barrier for Tailings: the Microbial Functional Diversity in the Compacted Sludge Specimen

Compacted Sewage Sludge as a Barrier for Tailings: the Bacterial Community Structure Diversity in the Compacted Sludge

The in Situ Shear Strength of Roller Compacted Concrete in the Dam of Jinghong Hydropower Station

GitHub Compacted-crx插件

最新资源