Perl正则表达式陷阱与深入解析

需积分: 4 27 浏览量更新于2024-12-24 收藏 207KB PDF 举报

Perl正则表达式是强大的文本处理工具，它们在SAS Version 9及以上版本中通过PRX家族函数和调用方法得以集成。这门技术虽然强大但因其复杂性常被用户视为“加密”的字符功能。本文旨在扩展现有的Perl正则表达式教程，着重解决编写过程中可能遇到的一些陷阱和微妙之处。首先，引入了Perl正则表达式的概念，这些模式能够用来定义一般的文本特征，以便在后续的匹配和文本操作中广泛应用。例如，"word boundary"（单词边界）允许你在单词的开始或结束位置定位；"negative lookaheads"（否定前瞻）用于排除特定模式之前的内容；"positive lookbehinds"（正向回顾）则在查找之前查找模式；"zero-width assertions"（零宽度断言）确保了在某些特定位置执行匹配，而不会消耗字符。论文以一个虚构的临床试验不良事件数据集为例，深入探讨了如非贪婪量词（non-greedy quantifiers）的概念，它们避免了在重复匹配时过度匹配。非贪婪模式在找到第一个匹配项后立即停止，而非试图找到尽可能多的匹配。"Anchors"（锚点）则是固定位置的匹配，例如`^`用于匹配字符串开头，`$`用于匹配结尾。此外，论文还提到了"non-capturing buffers"（非捕获组），它们是用于分组但不存储结果的括号，这对于复杂的模式结构很有帮助，提高了代码的可读性。这些技巧对于那些已经掌握了基本正则表达式知识的读者来说，将是一次提升和深化理解的宝贵机会。 Perl正则表达式的使用并非易事，但通过理解和掌握文中提到的这些关键概念，用户可以更有效地处理各种文本数据，并避免在实际应用中陷入潜在的陷阱。学习如何正确利用零宽度断言、锚点、非贪婪量词和非捕获组，能极大地提高编程效率和正则表达式的灵活性。因此，无论你是初学者还是经验丰富的开发者，这篇文章都将为你的Perl正则表达式之旅提供有价值的补充指导。

Before describing the regular expression needed for the task, some general comments about the way the PRX functions are

used in this example used are in order. In the code in Figure 2A, the regular expression is compiled only once at the first

iteration of the DATA step using the PRXPARSE function. The value assigned to the variable _re by the compilation of

the regular expression is retained and available for future use as an argument to the PRXMATCH function. Checking for

errors in the regular expression is also done at this juncture. If the regular expression is not syntactically correct, a missing

value will be assigned to variable _re. If this occurs, a polite reminder is written to the SAS log and the DATA step will

then terminate. The compile once behavior can also be accomplished using the /o modifier under most conditions. Future

examples will have the regex entered directly in the PRXMATCH function located in the WHERE clause of PROC SQL,

where one-time compilation of the regex occurs automatically. Generally speaking, one-time compilation of the regex

occurs when used in WHERE clauses of any PROC or DATA step. However, you lose the ability to do any error checking

on the regex.

Comments on the regex defined in the PRXPARSE function are included in Figure 2A. The variable match_pos contains

the starting position in the variable where the first regex match begins. If the regex does not match the variable in the

second argument in the PRXMATCH function, then match_pos is assigned the value of 0. The field OcularYN is a

Boolean field that has a value of 1 if the observation is flagged as an ocular adverse event (i.e. match_pos greater than

zero). Displayed in Figure 2B below are some observations from the Find_Ocular_AE data set. Notice that we were

indeed successful flagging observation #4 for the

OS part of the string and not the ou part of ‘Dangerous’, which is

confirmed by noting the match starts at position 20 and not 7.

Figure 2B – Partial Results from Ocular AE Matching

Obs Adverse Event Text match_pos OcularYN

Dangerous glaucoma OS 20 1

migraine headache 0 0

lower backache 0 0

festering headwound 0 0

O.D. has issues 1 1

The ‘ou’ from

‘headwound’

does not

match the

definition of

an ocular

adverse event

Now consider the regular expression in Figure 2C below where the metacharacter \b is replaced by the character class \s

Is the regular expression below equivalent to the one in Figure 2A?

Figure 2C – \b is a Zero-Width Assertion

_re=prxparse('/\s[o0]\.?[uds]\.?/i');

The character class \s

consumes at least one byte

of space, but \b does not.

Obs Adverse Event Text match_pos OcularYN

O.D. has issues 0 0

Inspecting the values of the fields in observation #8 we find that the two regexen are not equivalent. The reason is that \s

consumes at least one byte of the character field. The string O.D. is in the first four bytes of the field, so the leading space

character is not matched. The metacharacter \b does not consume any bytes, and in Perl this is referred to as a zero-width

assertion. This supermodel-like phenomenon will show up again throughout this paper.

To refresh your memory, \s is the Whitespace character class which includes space, tab, carriage return character and some

other peculiar beasts.

SAS Global Forum 2007

Posters

剩余10页未读，继续阅读

lengmianfo

粉丝: 1
资源: 2

Perl正则表达式陷阱与深入解析

perl regular expressions

An Introduction to Perl Regular Expressions in SAS

PERL REGULAR EXPRESSIONS QUICK START

mastering regular expressions 3rd 中文版

c语言 解析正则表达式

pcre2-10.23-2.el7.x86_64包

alert http any any -> any any (msg:"疑似SSRF攻击"; http.uri; content: "="; pcre:"/=file|=http|=https|=dict|=gopher|=phar/i"; classtype: web-ssrf-attack; sid: 561006; rev: 1;)这里面的pcre是什么意思

pcre limits exceeded

nginx 正则表达式

doris 报错 hs_compile regex pattern error:Embedded start anchors not supported.

最新资源

c语言解析正则表达式