基于重复字符串的高效正则表达式学习算法

研究论文

152 浏览量更新于2024-08-26 收藏 585KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本文主要探讨了一种基于重复字符串检测的正则表达式学习算法，它在信息提取、XML模式学习以及生物序列分析等众多领域具有重要应用价值。正则表达式是一种强大的文本处理工具，通过描述字符或字符集的组合规则，能够高效地匹配和处理文本数据。然而，如何从有限的样本中推断出复杂的正则表达式是一项挑战性任务。该算法的核心思想是利用重复字符串的特性来构建和学习正则模式。它特别关注一类特殊的正则表达式，允许Kleene星(*)和Kleene加(+)操作符作用于多个字符。这些操作符使得表达式可以匹配任意数量（包括零个）的重复字符序列，对于模式的灵活性和描述能力至关重要。作者强调，算法不仅追求准确性，还注重效率，这对于在大数据背景下实时处理和解析文本具有实际意义。算法的设计过程可能涉及到字符串的哈希或者滑动窗口技术，以便快速识别重复模式。通过迭代和优化，它能逐步构建出能够捕获输入样本中重复模式的正则表达式。初步实验结果显示，该算法在实际应用中表现出良好的效果，不仅在正确性上满足需求，而且在处理大规模数据时展现出了较高的执行效率。这篇研究论文属于计算理论范畴，具体来说，它涉及到了形式语言和自动机理论。形式语言理论研究的是用符号系统表示和理解信息的方式，而自动机则是用来模拟这些语言的抽象模型。作者们的研究成果有助于深化对正则表达式构造的理解，并可能推动该领域在实际问题中的进一步发展和应用。该篇论文提出了一种新颖的正则表达式学习方法，通过重复字符串检测实现了对复杂模式的有效建模。其理论基础和实践效能为正则表达式的高效学习和应用提供了新的视角，对于推动相关领域的研究和技术进步具有积极意义。

资源详情

资源推荐

An Algorithm for Learning Regular Expressions Based on

Repeated String Detection

Gang Lin

College of Computer Science and

Technology, Hua Qiao University

Xiamen, China

18950149979, 361021

a2888100@gmail.com

Lixiao Zheng

College of Computer Science and

Technology, Hua Qiao University

Xiamen, China

15980844031, 361021

zhenglx@hqu.edu.cn

Yuanyang Wang

College of Computer Science and

Technology, Hua Qiao University

Xiamen, China

17746044405, 361021

shylunule@gmail.com

ABSTRACT

The inference of regular expressions from a finite number of

samples has important applications in various fields such as

information extraction, XML schema learning, biological

sequence analysis. In this paper, we present an algorithm for

learning regular expressions based on repeated string detection.

The algorithm can learn a subclass of regular expressions in which

unary operators such that Kleene star and Kleene plus can apply

on multiple characters. Preliminary experimental results

demonstrate the effectiveness and efficiency of the proposed

algorithm.

CCS Concepts

Theory of computation ➝ Formal languages and automata

theory ➝ Regular language

Keywords

Regular expression; learning algorithm; repeated string detection.

1. INTRODUCTION

The inference of regular expressions from a finite number of

samples has important applications in various fields such as

information extraction [1], XML schema learning [2], biological

sequence analysis [3] etc. One straightforward way of learning

regular expressions is based on finite automata. That is, we first

derive an automaton from the input samples, and then use any

classical algorithm to convert the automaton into a regular

expression. While, the disadvantage of basing RE learning on the

inference of automata is that when you finally turn your results

into REs by standard algorithms, there will be a problem of length

explosion, which was empirically validated by Blackwell in [4].

So, a better way is to derive regular expressions directly. In this

paper, we propose a learning algorithm which, instead of utilizing

automata, takes regular expressions as the direct target.

According to Gold’s learning in the limit theory, the whole class

of regular expressions cannot be learned from positive data [5].

Therefore, some scholars have defined subclasses of regular

expressions by restricting the form of regular expressions and

provided learning algorithms that can learn those subclasses.

Bex G J restricted that each character can only appear once in the

expression, called single occurrence regular expression and gives

the learning algorithm [6]. The core of the algorithm is first

constructing a single occurrence automaton from the input sample,

then derives single occurrence regular expression from this

automaton by a series of rewrite rules. This paper also examines

the learning algorithm of another special subclass: chain regular

expressions. Recently, Freydenberger D proved that the learning

algorithms in [6] have an over-generalization problem, that is, the

language defined by the recognized expression is much larger

than the input sample. In light of this, Freydenberger D gave a

new linear algorithm for learning single occurrence and chain

regular expressions, and proves that the learning result of the

algorithm is the smallest one of all single occurrence or chain

regular expressions containing the input samples [7], which

satisfies the descriptive generalizations as described by the

authors [8]. The authors of literature [9] used the probability

model to generate a k-occurrence automaton from the sample set,

then rewrites it to k-occurrence-regular-expression where each

character can appear at most k times.

All of the above work restrict that the appearance of characters

must be in a limited range, either once or at most k times. H.

Fernau presented a regular expression learning algorithm [10]

which does not impose such restrictions. The algorithm is based

on block left alignment. It firstly blocks the input string such that

the repeated letter will be in a same block. Secondly, the

algorithm aligns the blocks from left to right. After generalization,

the algorithm output the corresponding expression. However, this

algorithm has its own shortcomings. One of the disadvantages is

that in the learned regular expressions, Kleene closure operator (*

or +) acts only on a single letter. For example, when input is string

“ababab”, the algorithm will output the string itself, ignoring the

fact that it can be denoted as ab

for simplicity. To overcome this

disadvantage, in this paper, we propose a repeated-string-

detection based learning algorithm which can learn from multiple

positive samples and output regular expressions where Kleene

closure (* or +) can acts on multiple letters.

Organization of the paper. In the next section, we define

terminologies and notions used in the paper. In Section 3, we

describe our learning algorithm. In Section 4, we experimentally

evaluate the performance of our algorithm. Finally, we conclude

our paper and outline future directions in Section 5.

2. PRELIMINARIES

In this section we introduce the concepts and notations that we use

throughout the paper.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy

otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specific permission and/or a fee.

CSAI '18, December 8–10, 2018, Shenzhen, China

ACM ISBN 978-1-4503-6606-9/18/12…$15.00

DOI: https://doi.org/10.1145/3297156.3297262

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38693476

粉丝: 1
资源: 949

基于重复字符串的高效正则表达式学习算法

使用正则表达式检测字符串中重复出现的词

正则表达式匹配算法

c 正则表达式转nfa

Python利用正则表达式抓取页面部分信息算法设计思想

请用算法与数据结构的知识详细解释：字符串处理、哈希表、动态规划、正则表达式、贪心算法、搜索算法

Python正则表达式里的贪心算法

文法怎么生成正则表达式

正则回溯java_正则表达式灾难性回溯

基于递归正则表达式匹配基于python

正则表达式匹配leetcode

fpga的正则表达式

正则表达式导致cpu使用率过高

使用jieba进行分词和正则表达式的应用的实验原理

如何在pycharm中想用正则表达式查询Mongodb数据库中包含的某个字符串发现$regex标红报错？

直接从正则表达式到dfa

5.身份证的正则表达式。

{"foodld": "05esffsafsaf", "name": "班算法是"} 用正则表达式匹配foodld的数据

aes-256-cbc 密文正则表达式

python正则表达式 () {} [] 不同括号之间的区别

正则表达式中懒惰表达

最新资源