没有合适的资源?快使用搜索试试~ 我知道了~
首页基于重复字符串的高效正则表达式学习算法
本文主要探讨了一种基于重复字符串检测的正则表达式学习算法,它在信息提取、XML模式学习以及生物序列分析等众多领域具有重要应用价值。正则表达式是一种强大的文本处理工具,通过描述字符或字符集的组合规则,能够高效地匹配和处理文本数据。然而,如何从有限的样本中推断出复杂的正则表达式是一项挑战性任务。 该算法的核心思想是利用重复字符串的特性来构建和学习正则模式。它特别关注一类特殊的正则表达式,允许Kleene星(*)和Kleene加(+)操作符作用于多个字符。这些操作符使得表达式可以匹配任意数量(包括零个)的重复字符序列,对于模式的灵活性和描述能力至关重要。作者强调,算法不仅追求准确性,还注重效率,这对于在大数据背景下实时处理和解析文本具有实际意义。 算法的设计过程可能涉及到字符串的哈希或者滑动窗口技术,以便快速识别重复模式。通过迭代和优化,它能逐步构建出能够捕获输入样本中重复模式的正则表达式。初步实验结果显示,该算法在实际应用中表现出良好的效果,不仅在正确性上满足需求,而且在处理大规模数据时展现出了较高的执行效率。 这篇研究论文属于计算理论范畴,具体来说,它涉及到了形式语言和自动机理论。形式语言理论研究的是用符号系统表示和理解信息的方式,而自动机则是用来模拟这些语言的抽象模型。作者们的研究成果有助于深化对正则表达式构造的理解,并可能推动该领域在实际问题中的进一步发展和应用。 该篇论文提出了一种新颖的正则表达式学习方法,通过重复字符串检测实现了对复杂模式的有效建模。其理论基础和实践效能为正则表达式的高效学习和应用提供了新的视角,对于推动相关领域的研究和技术进步具有积极意义。
资源详情
资源推荐
An Algorithm for Learning Regular Expressions Based on
Repeated String Detection
Gang Lin
College of Computer Science and
Technology, Hua Qiao University
Xiamen, China
18950149979, 361021
a2888100@gmail.com
Lixiao Zheng
College of Computer Science and
Technology, Hua Qiao University
Xiamen, China
15980844031, 361021
zhenglx@hqu.edu.cn
Yuanyang Wang
College of Computer Science and
Technology, Hua Qiao University
Xiamen, China
17746044405, 361021
shylunule@gmail.com
ABSTRACT
The inference of regular expressions from a finite number of
samples has important applications in various fields such as
information extraction, XML schema learning, biological
sequence analysis. In this paper, we present an algorithm for
learning regular expressions based on repeated string detection.
The algorithm can learn a subclass of regular expressions in which
unary operators such that Kleene star and Kleene plus can apply
on multiple characters. Preliminary experimental results
demonstrate the effectiveness and efficiency of the proposed
algorithm.
CCS Concepts
Theory of computation ➝ Formal languages and automata
theory ➝ Regular language
Keywords
Regular expression; learning algorithm; repeated string detection.
1. INTRODUCTION
The inference of regular expressions from a finite number of
samples has important applications in various fields such as
information extraction [1], XML schema learning [2], biological
sequence analysis [3] etc. One straightforward way of learning
regular expressions is based on finite automata. That is, we first
derive an automaton from the input samples, and then use any
classical algorithm to convert the automaton into a regular
expression. While, the disadvantage of basing RE learning on the
inference of automata is that when you finally turn your results
into REs by standard algorithms, there will be a problem of length
explosion, which was empirically validated by Blackwell in [4].
So, a better way is to derive regular expressions directly. In this
paper, we propose a learning algorithm which, instead of utilizing
automata, takes regular expressions as the direct target.
According to Gold’s learning in the limit theory, the whole class
of regular expressions cannot be learned from positive data [5].
Therefore, some scholars have defined subclasses of regular
expressions by restricting the form of regular expressions and
provided learning algorithms that can learn those subclasses.
Bex G J restricted that each character can only appear once in the
expression, called single occurrence regular expression and gives
the learning algorithm [6]. The core of the algorithm is first
constructing a single occurrence automaton from the input sample,
then derives single occurrence regular expression from this
automaton by a series of rewrite rules. This paper also examines
the learning algorithm of another special subclass: chain regular
expressions. Recently, Freydenberger D proved that the learning
algorithms in [6] have an over-generalization problem, that is, the
language defined by the recognized expression is much larger
than the input sample. In light of this, Freydenberger D gave a
new linear algorithm for learning single occurrence and chain
regular expressions, and proves that the learning result of the
algorithm is the smallest one of all single occurrence or chain
regular expressions containing the input samples [7], which
satisfies the descriptive generalizations as described by the
authors [8]. The authors of literature [9] used the probability
model to generate a k-occurrence automaton from the sample set,
then rewrites it to k-occurrence-regular-expression where each
character can appear at most k times.
All of the above work restrict that the appearance of characters
must be in a limited range, either once or at most k times. H.
Fernau presented a regular expression learning algorithm [10]
which does not impose such restrictions. The algorithm is based
on block left alignment. It firstly blocks the input string such that
the repeated letter will be in a same block. Secondly, the
algorithm aligns the blocks from left to right. After generalization,
the algorithm output the corresponding expression. However, this
algorithm has its own shortcomings. One of the disadvantages is
that in the learned regular expressions, Kleene closure operator (*
or +) acts only on a single letter. For example, when input is string
“ababab”, the algorithm will output the string itself, ignoring the
fact that it can be denoted as ab
3
for simplicity. To overcome this
disadvantage, in this paper, we propose a repeated-string-
detection based learning algorithm which can learn from multiple
positive samples and output regular expressions where Kleene
closure (* or +) can acts on multiple letters.
Organization of the paper. In the next section, we define
terminologies and notions used in the paper. In Section 3, we
describe our learning algorithm. In Section 4, we experimentally
evaluate the performance of our algorithm. Finally, we conclude
our paper and outline future directions in Section 5.
2. PRELIMINARIES
In this section we introduce the concepts and notations that we use
throughout the paper.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
CSAI '18, December 8–10, 2018, Shenzhen, China
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-6606-9/18/12…$15.00
DOI: https://doi.org/10.1145/3297156.3297262
下载后可阅读完整内容,剩余4页未读,立即下载
weixin_38693476
- 粉丝: 1
- 资源: 949
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- Flex垃圾回收与内存管理:防止内存泄露
- Python编程规范与最佳实践
- EJB3入门:实战教程与核心概念详解
- Python指南v2.6简体中文版——入门教程
- ANSYS单元类型详解:从Link1到Link11
- 深度解析C语言特性与实践应用
- Gentoo Linux安装与使用全面指南
- 牛津词典txt版:信息技术领域的便捷电子书
- VC++基础教程:从入门到精通
- CTO与程序员职业规划:能力提升与路径指南
- Google开放手机联盟与Android开发教程
- 探索Android触屏界面开发:从入门到设计原则
- Ajax实战:从理论到实践
- 探索Android应用开发:从入门到精通
- LM317T稳压管详解:1.5A可调输出,过载保护
- C语言实现SOCKET文件传输简单教程
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功