Ubuntu环境下构建Flex(Fast Lexical Analyzer)用户手册

需积分: 9 71 浏览量更新于2024-07-18 收藏 540KB PDF 举报

"Flex(The Fast Lexical Analyzer)用户手册(pdf)" Flex是一个广泛使用的工具，用于创建词法分析器，也称为扫描器或词法解析器。它是一个开源项目，托管在GitHub上，主要用于处理正则表达式，将源代码转换为可执行的C代码，这个C代码能够识别输入文本中的特定模式。Flex特别适用于编译器构造、解析器生成和其他需要快速处理文本流的场景。本手册版本为2.6.4，由Vern Paxson、Will Estes和John Millaway编写，其版权遵循与Flex项目相同的许可条件，允许自由分发和修改，但需保留原始的版权信息和免责声明。这个手册的生成在Ubuntu环境下完成，因为Windows平台可能存在一些构建上的困难。 Flex的工作原理是基于用户定义的规则集，这些规则描述了如何识别和处理输入文本中的不同模式。例如，一个简单的规则可能是匹配数字序列，并将它们作为整数处理。用户通过编写lex规格文件（通常命名为`.l`或`.ll`）来定义这些规则，文件中包含正则表达式和相应的C代码块。在lex规格文件中，每个规则由一个正则表达式和一组动作组成。当Flex扫描器在输入中找到匹配的模式时，它会执行相应规则的动作。动作可以是输出某个字符串、调用用户定义的函数，或者进行更复杂的操作。Flex通过将规格文件转换为C代码，然后编译生成词法分析器，这个分析器可以在运行时动态识别和处理输入文本。使用Flex有几个关键概念： 1. **模式匹配**：Flex通过正则表达式识别文本模式。正则表达式是一种强大的模式匹配语言，允许用户精确地指定想要查找的文本结构。 2. **优先级和冲突解决**：如果存在多个规则可以匹配同一个输入，Flex会根据规则的优先级来决定执行哪个。通常，较长的模式优先级更高。 3. **开始状态和结束状态**：Flex支持开始状态和结束状态的概念，允许用户在不同的上下文中定义不同的词法规则。 4. **缓冲区管理**：Flex内部维护了一个缓冲区，用于存储输入文本，这使得它可以回溯和重新匹配文本。 5. **用户自定义函数**：用户可以在规格文件中插入C代码，定义在匹配特定模式时需要执行的操作。 6. **错误处理**：Flex提供了一些内置机制来处理无法匹配的输入和错误情况，用户可以通过定义错误处理函数来定制错误处理行为。使用Flex生成的词法分析器常与Yacc（Yet Another Compiler-Compiler）配合使用，Yacc是一个语法分析器生成器。两者结合，可以构建完整的解析器，用于处理高级的编程语言或特定的文本格式。在实际应用中，Flex不仅限于编译器构造，还可以用于解析日志文件、处理配置文件、文本分析等多种任务。它的灵活性和高效性使其成为开发工具箱中的一个重要组成部分。对于需要处理大量文本数据的开发者来说，理解和掌握Flex的使用是非常有价值的技能。

Chapter 6: Patterns 9

6 Patterns

The patterns in the input (see Section 5.2 [Rules Section], page 7) are written using an

extended set of regular expressions. These are:

‘x’ match the character ’x’

‘.’ any character (byte) except newline

‘[xyz]’ a character class; in this case, the pattern matches either an ’x’, a ’y’, or a ’z’

‘[abj-oZ]’

a "character class" with a range in it; matches an ’a’, a ’b’, any letter from ’j’

through ’o’, or a ’Z’

‘[^A-Z]’ a "negated character class", i.e., any character but those in the class. In this

case, any character EXCEPT an uppercase letter.

‘[^A-Z\n]’

any character EXCEPT an uppercase letter or a newline

‘[a-z]{-}[aeiou]’

the lowercase consonants

‘r*’ zero or more r’s, where r is any regular expression

‘r+’ one or more r’s

‘r?’ zero or one r’s (that is, “an optional r”)

‘r{2,5}’ anywhere from two to ﬁve r’s

‘r{2,}’ two or more r’s

‘r{4}’ exactly 4 r’s

‘{name}’ the expansion of the ‘name’ deﬁnition (see Chapter 5 [Format], page 6).

‘"[xyz]\"foo"’

the literal string: ‘[xyz]"foo’

‘\X’ if X is ‘a’, ‘b’, ‘f’, ‘n’, ‘r’, ‘t’, or ‘v’, then the ANSI-C interpretation of ‘\x’.

Otherwise, a literal ‘X’ (used to escape operators such as ‘*’)

‘\0’ a NUL character (ASCII code 0)

‘\123’ the character with octal value 123

‘\x2a’ the character with hexadecimal value 2a

‘(r)’ match an ‘r’; parentheses are used to override precedence (see below)

‘(?r-s:pattern)’

apply option ‘r’ and omit option ‘s’ while interpreting pattern. Options may

be zero or more of the characters ‘i’, ‘s’, or ‘x’.

‘i’ means case-insensitive. ‘-i’ means case-sensitive.

‘s’ alters the meaning of the ‘.’ syntax to match any single byte whatsoever.

‘-s’ alters the meaning of ‘.’ to match any byte except ‘\n’.

Chapter 6: Patterns 10

‘x’ ignores comments and whitespace in patterns. Whitespace is ignored unless

it is backslash-escaped, contained within ‘""’s, or appears inside a character

class.

The following are all valid:

(?:foo) same as (foo)

(?i:ab7) same as ([aA][bB]7)

(?-i:ab) same as (ab)

(?s:.) same as [\x00-\xFF]

(?-s:.) same as [^\n]

(?ix-s: a . b) same as ([Aa][^\n][bB])

(?x:a b) same as ("ab")

(?x:a\ b) same as ("a b")

(?x:a" "b) same as ("a b")

(?x:a[ ]b) same as ("a b")

(?x:a

/* comment */

c) same as (abc)

‘(?# comment )’

omit everything within ‘()’. The ﬁrst ‘)’ character encountered ends the pat-

tern. It is not possible to for the comment to contain a ‘)’ character. The

comment may span lines.

‘rs’ the regular expression ‘r’ followed by the regular expression ‘s’; called concate-

nation

‘r|s’ either an ‘r’ or an ‘s’

‘r/s’ an ‘r’ but only if it is followed by an ‘s’. The text matched by ‘s’ is included

when determining whether this rule is the longest match, but is then returned

to the input before the action is executed. So the action only sees the text

matched by ‘r’. This type of pattern is called trailing context. (There are

some combinations of ‘r/s’ that ﬂex cannot match correctly. See Chapter 24

[Limitations], page 74, regarding dangerous trailing context.)

‘^r’ an ‘r’, but only at the beginning of a line (i.e., when just starting to scan, or

right after a newline has been scanned).

‘r$’ an ‘r’, but only at the end of a line (i.e., just before a newline). Equivalent to

‘r/\n’.

Note that flex’s notion of “newline” is exactly whatever the C compiler used

to compile flex interprets ‘\n’ as; in particular, on some DOS systems you

must either ﬁlter out ‘\r’s in the input yourself, or explicitly use ‘r/\r\n’ for

‘r$’.

‘<s>r’ an ‘r’, but only in start condition s (see Chapter 10 [Start Conditions], page 21

for discussion of start conditions).

‘<s1,s2,s3>r’

same, but in any of start conditions s1, s2, or s3.

Chapter 6: Patterns 11

‘<*>r’ an ‘r’ in any start condition, even an exclusive one.

‘<<EOF>>’ an end-of-ﬁle.

‘<s1,s2><<EOF>>’

an end-of-ﬁle when in start condition s1 or s2

Note that inside of a character class, all regular expression operators lose their special

meaning except escape (‘\’) and the character class operators, ‘-’, ‘]]’, and, at the beginning

of the class, ‘^’.

The regular expressions listed above are grouped according to precedence, from high-

est precedence at the top to lowest at the bottom. Those grouped together have equal

precedence (see special note on the precedence of the repeat operator, ‘{}’, under the doc-

umentation for the ‘--posix’ POSIX compliance option). For example,

foo|bar*

is the same as

(foo)|(ba(r*))

since the ‘*’ operator has higher precedence than concatenation, and concatenation

higher than alternation (‘|’). This pattern therefore matches either the string ‘foo’ or

the string ‘ba’ followed by zero-or-more ‘r’’s. To match ‘foo’ or zero-or-more repetitions of

the string ‘bar’, use:

foo|(bar)*

And to match a sequence of zero or more repetitions of ‘foo’ and ‘bar’:

(foo|bar)*

In addition to characters and ranges of characters, character classes can also contain

character class expressions. These are expressions enclosed inside ‘[:’ and ‘:]’ delimiters

(which themselves must appear between the ‘[’ and ‘]’ of the character class. Other elements

may occur inside the character class, too). The valid expressions are:

[:alnum:] [:alpha:] [:blank:]

[:cntrl:] [:digit:] [:graph:]

[:lower:] [:print:] [:punct:]

[:space:] [:upper:] [:xdigit:]

These expressions all designate a set of characters equivalent to the corresponding stan-

dard C isXXX function. For example, ‘[:alnum:]’ designates those characters for which

isalnum() returns true - i.e., any alphabetic or numeric character. Some systems don’t

provide isblank(), so ﬂex deﬁnes ‘[:blank:]’ as a blank or a tab.

For example, the following character classes are all equivalent:

[[:alnum:]]

[[:alpha:][:digit:]]

[[:alpha:][0-9]]

[a-zA-Z0-9]

A word of caution. Character classes are expanded immediately when seen in the flex

input. This means the character classes are sensitive to the locale in which flex is executed,

and the resulting scanner will not be sensitive to the runtime locale. This may or may not

be desirable.

Chapter 6: Patterns 12

• If your scanner is case-insensitive (the ‘-i’ ﬂag), then ‘[:upper:]’ and ‘[:lower:]’ are

equivalent to ‘[:alpha:]’.

• Character classes with ranges, such as ‘[a-Z]’, should be used with caution in a case-

insensitive scanner if the range spans upper or lowercase characters. Flex does not

know if you want to fold all upper and lowercase characters together, or if you want the

literal numeric range speciﬁed (with no case folding). When in doubt, ﬂex will assume

that you meant the literal numeric range, and will issue a warning. The exception to

this rule is a character range such as ‘[a-z]’ or ‘[S-W]’ where it is obvious that you

want case-folding to occur. Here are some examples with the ‘-i’ ﬂag enabled:

Range Result Literal Range Alternate Range

‘[a-t]’ ok ‘[a-tA-T]’

‘[A-T]’ ok ‘[a-tA-T]’

‘[A-t]’ ambiguous ‘[A-Z\[\\\]_‘a-t]’ ‘[a-tA-T]’

‘[_-{]’ ambiguous ‘[_‘a-z{]’ ‘[_‘a-zA-Z{]’

‘[@-C]’ ambiguous ‘[@ABC]’ ‘[@A-Z\[\\\]_‘abc]’

• A negated character class such as the example ‘[^A-Z]’ above will match a newline

unless ‘\n’ (or an equivalent escape sequence) is one of the characters explicitly present

in the negated character class (e.g., ‘[^A-Z\n]’). This is unlike how many other regular

expression tools treat negated character classes, but unfortunately the inconsistency is

historically entrenched. Matching newlines means that a pattern like ‘[^"]*’ can match

the entire input unless there’s another quote in the input.

Flex allows negation of character class expressions by prepending ‘^’ to the POSIX

character class name.

[:^alnum:] [:^alpha:] [:^blank:]

[:^cntrl:] [:^digit:] [:^graph:]

[:^lower:] [:^print:] [:^punct:]

[:^space:] [:^upper:] [:^xdigit:]

Flex will issue a warning if the expressions ‘[:^upper:]’ and ‘[:^lower:]’ appear in

a case-insensitive scanner, since their meaning is unclear. The current behavior is to

skip them entirely, but this may change without notice in future revisions of ﬂex.

•

The ‘{-}’ operator computes the diﬀerence of two character classes. For example,

‘[a-c]{-}[b-z]’ represents all the characters in the class ‘[a-c]’ that are not in the

class ‘[b-z]’ (which in this case, is just the single character ‘a’). The ‘{-}’ operator

is left associative, so ‘[abc]{-}[b]{-}[c]’ is the same as ‘[a]’. Be careful not to

accidentally create an empty set, which will never match.

•

The ‘{+}’ operator computes the union of two character classes. For example,

‘[a-z]{+}[0-9]’ is the same as ‘[a-z0-9]’. This operator is useful when preceded

by the result of a diﬀerence operation, as in, ‘[[:alpha:]]{-}[[:lower:]]{+}[q]’,

which is equivalent to ‘[A-Zq]’ in the "C" locale.

• A rule can have at most one instance of trailing context (the ‘/’ operator or the ‘$’

operator). The start condition, ‘^’, and ‘<<EOF>>’ patterns can only occur at the

beginning of a pattern, and, as well as with ‘/’ and ‘$’, cannot be grouped inside

parentheses. A ‘^’ which does not occur at the beginning of a rule or a ‘$’ which does

剩余140页未读，继续阅读

薛定谔之死猫

粉丝: 724
资源: 66

Ubuntu环境下构建Flex(Fast Lexical Analyzer)用户手册

flex使用手册(The Fast Lexical Analyzer)

Analyzer 使用手冊

flex词法解析器（Lexical Analysis with Flex）

C language lexical analyzer

RE/flex lexical analyzer generator:以正则表达式为中心的快速词法分析器生成器，用于C ++-开源

Lexical Analyzer and Parser Generator-开源

编译器设计之代码分析工具：Lexical Analyzer Generators (Lex.zip

Lexical-Analyzer

lexical_analyzer

lexical-analyzer

最新资源