自然语言处理综论（第3版）：神经网络与深度学习详解

需积分: 33 117 浏览量更新于2024-07-18 2 收藏 18.24MB PDF 举报

《自然语言处理综论：计算语言学与语音识别》是斯坦福大学丹尼尔·朱尔夫斯基教授和科罗拉多大学波德分校詹姆斯·马丁教授合著的一本经典教材，已进入第三版修订。该版本在2018年9月23日的草案中，对自然语言处理（NLP）的内容进行了重大扩展，尤其侧重于神经网络技术的应用，如循环神经网络（RNN）和长短时记忆网络（LSTM）。这一版不仅保留了传统的语言模型（如n-gram模型）、朴素贝叶斯分类和情感分析等内容，还新增了深度学习的章节，让读者能全面理解从传统统计方法到现代深度学习模型在处理自然语言上的进展。第7章深入探讨了神经网络和神经语言模型，讲解了如何利用这些模型来捕捉和理解语义表示，以及它们在诸如词向量表示（Vector Semantics）中的应用。这使得读者能够了解这些模型如何通过多层非线性变换来模拟人类大脑处理语言的能力。在第9章，作者详细介绍了序列处理中的循环神经网络（Recurrent Neural Networks），这对于文本建模、机器翻译、语音识别等任务至关重要，因为这些任务往往涉及处理具有时间依赖性的序列数据。此外，书中还涵盖了自然语言的语法结构，如第10章的正式英语语法、第11章的句法分析（Syntactic Parsing）、第12章和13章的统计和依赖性解析。对于语义层面的理解，第14章探讨了句子意义的表示，而第15章和16章则进一步研究了计算语义和语义解析，这些都是构建智能系统理解深层含义的关键部分。信息提取（Information Extraction）、语义角色标注（Semantic Role Labeling）和情感词汇表的构建（Lexicons for Sentiment, Affect, and Connotation）也在书中有所涉及，帮助读者掌握如何从文本中抽取有用信息并分析情感倾向。最后，核心ference分辨率（Coreference Resolution）和实体链接（Entity Linking）、篇章连贯性（Discourse Coherence）以及机器翻译（Machine Translation）等领域也得到了详尽的阐述。《自然语言处理综论》第三版草案以其全面的内容更新和深入浅出的讲解，为读者提供了一个从基础到前沿的NLP学习路径，无论你是初学者还是研究者，都能从中受益匪浅。

16 CHAPTER 2 • REGULAR EXPRESSIONS, TEXT NORMALIZATION, EDIT DISTANCE

errors comes up again and again in implementing speech and language processing

systems. Reducing the overall error rate for an application thus involves two antag-

onistic efforts:

• Increasing precision (minimizing false positives)

• Increasing recall (minimizing false negatives)

2.1.4 A More Complex Example

Let’s try out a more signiﬁcant example of the power of REs. Suppose we want to

build an application to help a user buy a computer on the Web. The user might want

“any machine with at least 6 GHz and 500 GB of disk space for less than $1000”.

To do this kind of retrieval, we ﬁrst need to be able to look for expressions like 6

GHz or 500 GB or Mac or $999.99. In the rest of this section we’ll work out some

simple regular expressions for this task.

First, let’s complete our regular expression for prices. Here’s a regular expres-

sion for a dollar sign followed by a string of digits:

/$[0-9]+/

Note that the $ character has a different function here than the end-of-line function

we discussed earlier. Most regular expression parsers are smart enough to realize

that $ here doesn’t mean end-of-line. (As a thought experiment, think about how

regex parsers might ﬁgure out the function of $ from the context.)

Now we just need to deal with fractions of dollars. We’ll add a decimal point

and two digits afterwards:

/$[0-9]+\.[0-9][0-9]/

This pattern only allows $199.99 but not $199. We need to make the cents

optional and to make sure we’re at a word boundary:

/(ˆ|\W)$[0-9]+(\.[0-9][0-9])?\b/

One last catch! This pattern allows prices like $199999.99 which would be far

too expensive! We need to limit the dollar

/(ˆ|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/

How about speciﬁcations for > 6GHz processor speed? Here’s a pattern for that:

/\b[6-9]+ *(GHz|[Gg]igahertz)\b/

Note that we use / */ to mean “zero or more spaces” since there might always

be extra spaces lying around. For disk space, we’ll need to allow for optional frac-

tions again (5.5 GB); note the use of ? for making the ﬁnal s optional:

/\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/

Modifying this regular expression so that it only matches more than 500 GB is

left as an exercise for the reader.

2.1.5 More Operators

Figure 2.7 shows some aliases for common ranges, which can be used mainly to

save typing. Besides the Kleene * and Kleene + we can also use explicit numbers as

2.1 • REGULAR EXPRESSIONS 17

counters, by enclosing them in curly brackets. The regular expression /{3}/ means

“exactly 3 occurrences of the previous character or expression”. So /a\.{24}z/

will match a followed by 24 dots followed by z (but not a followed by 23 or 25 dots

followed by a z).

RE Expansion Match First Matches

\d [0-9] any digit Party of 5

\D [ˆ0-9] any non-digit Blue moon

\w [a-zA-Z0-9_] any alphanumeric/underscore Daiyu

\W [ˆ\w] a non-alphanumeric !!!!

\s [ \r\t\n\f] whitespace (space, tab)

\S [ˆ\s] Non-whitespace in Concord

Figure 2.7 Aliases for common sets of characters.

A range of numbers can also be speciﬁed. So /{n,m}/ speciﬁes from n to m

occurrences of the previous char or expression, and /{n,}/ means at least n occur-

rences of the previous expression. REs for counting are summarized in Fig. 2.8.

RE Match

* zero or more occurrences of the previous char or expression

+ one or more occurrences of the previous char or expression

? exactly zero or one occurrence of the previous char or expression

{n} n occurrences of the previous char or expression

{n,m} from n to m occurrences of the previous char or expression

{n,} at least n occurrences of the previous char or expression

{,m} up to m occurrences of the previous char or expression

Figure 2.8 Regular expression operators for counting.

Finally, certain special characters are referred to by special notation based on the

backslash (\) (see Fig. 2.9). The most common of these are the newline character

Newline

\n and the tab character \t. To refer to characters that are special themselves (like

., *, [, and \), precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).

RE Match First Patterns Matched

\* an asterisk “*” “K*A*P*L*A*N”

\. a period “.” “Dr. Livingston, I presume”

\? a question mark “Why don’t they come and lend a hand?”

\n a newline

\t a tab

Figure 2.9 Some characters that need to be backslashed.

2.1.6 Regular Expression Substitution, Capture Groups, and ELIZA

An important use of regular expressions is in substitutions. For example, the substi-

substitution

tution operator s/regexp1/pattern/ used in Python and in Unix commands like

vim or sed allows a string characterized by a regular expression to be replaced by

another string:

s/colour/color/

It is often useful to be able to refer to a particular subpart of the string matching

the ﬁrst pattern. For example, suppose we wanted to put angle brackets around all

18 CHAPTER 2 • REGULAR EXPRESSIONS, TEXT NORMALIZATION, EDIT DISTANCE

integers in a text, for example, changing the 35 boxes to the <35> boxes. We’d

like a way to refer to the integer we’ve found so that we can easily add the brackets.

To do this, we put parentheses ( and ) around the ﬁrst pattern and use the number

operator \1 in the second pattern to refer back. Here’s how it looks:

s/([0-9]+)/<\1>/

The parenthesis and number operators can also specify that a certain string or

expression must occur twice in the text. For example, suppose we are looking for

the pattern “the Xer they were, the Xer they will be”, where we want to constrain

the two X’s to be the same string. We do this by surrounding the ﬁrst X with the

parenthesis operator, and replacing the second X with the number operator \1, as

follows:

/the (.*)er they were, the \1er they will be/

Here the \1 will be replaced by whatever string matched the ﬁrst item in paren-

theses. So this will match the bigger they were, the bigger they will be but not the

bigger they were, the faster they will be.

This use of parentheses to store a pattern in memory is called a capture group.

capture group

Every time a capture group is used (i.e., parentheses surround a pattern), the re-

sulting match is stored in a numbered register. If you match two different sets of

parentheses, \2 means whatever matched the second capture group. Thus

/the (.*)er they (.*), the \1er we \2/

will match the faster they ran, the faster we ran but not the faster they ran, the faster

we ate. Similarly, the third capture group is stored in \3, the fourth is \4, and so on.

Parentheses thus have a double function in regular expressions; they are used to

group terms for specifying the order in which operators should apply, and they are

used to capture something in a register. Occasionally we might want to use parenthe-

ses for grouping, but don’t want to capture the resulting pattern in a register. In that

case we use a non-capturing group, which is speciﬁed by putting the commands

non-capturing

group

?: after the open paren, in the form (?: pattern ).

/(?:some|a few) (people|cats) like some \1/

will match some cats like some cats but not some cats like some a few.

Substitutions and capture groups are very useful in implementing simple chat-

bots like ELIZA (Weizenbaum, 1966). Recall that ELIZA simulates a Rogerian

psychologist by carrying on conversations like the following:

User

: Men are all alike.

ELIZA

: IN WHAT WAY

User

: They’re always bugging us about something or other.

ELIZA

: CAN YOU THINK OF A SPECIFIC EXAMPLE

User

: Well, my boyfriend made me come here.

ELIZA

: YOUR BOYFRIEND MADE YOU COME HERE

User

: He says I’m depressed much of the time.

ELIZA

: I AM SORRY TO HEAR YOU ARE DEPRESSED

ELIZA works by having a series or cascade of regular expression substitutions

each of which matches and changes some part of the input lines. Input lines are

ﬁrst uppercased. The ﬁrst substitutions then change all instances of MY to YOUR,

and I’M to YOU ARE, and so on. The next set of substitutions matches and replaces

other patterns in the input. Here are some examples:

2.2 • WORDS 19

s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/

s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/

s/.* all .*/IN WHAT WAY/

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Since multiple substitutions can apply to a given input, substitutions are assigned

a rank and applied in order. Creating patterns is the topic of Exercise 2.3, and we

return to the details of the ELIZA architecture in Chapter 24.

2.1.7 Lookahead assertions

Finally, there will be times when we need to predict the future: look ahead in the

text to see if some pattern matches, but not advance the match cursor, so that we can

then deal with the pattern if it occurs.

These lookahead assertions make use of the (? syntax that we saw in the previ-

lookahead

ous section for non-capture groups. The operator (?= pattern) is true if pattern

occurs, but is zero-width, i.e. the match pointer doesn’t advance. The operator

zero-width

(?! pattern) only returns true if a pattern does not match, but again is zero-width

and doesn’t advance the cursor. Negative lookahead is commonly used when we

are parsing some complex pattern but want to rule out a special case. For example

suppose we want to match, at the beginning of a line, any single word that doesn’t

start with “Volcano”. We can use negative lookahead to do this:

/ˆ(?!Volcano)[A-Za-z]+/

2.2 Words

Before we talk about processing words, we need to decide what counts as a word.

Let’s start by looking at one particular corpus (plural corpora), a computer-readable

corpus

corpora

collection of text or speech. For example the Brown corpus is a million-word col-

lection of samples from 500 written English texts from different genres (newspa-

per, ﬁction, non-ﬁction, academic, etc.), assembled at Brown University in 1963–64

(Ku

cera and Francis, 1967). How many words are in the following Brown sentence?

He stepped out into the hall, was delighted to encounter a water brother.

This sentence has 13 words if we don’t count punctuation marks as words, 15

if we count punctuation. Whether we treat period (“.”), comma (“,”), and so on as

words depends on the task. Punctuation is critical for ﬁnding boundaries of things

(commas, periods, colons) and for identifying some aspects of meaning (question

marks, exclamation marks, quotation marks). For some tasks, like part-of-speech

tagging or parsing or speech synthesis, we sometimes treat punctuation marks as if

they were separate words.

The Switchboard corpus of American English telephone conversations between

strangers was collected in the early 1990s; it contains 2430 conversations averaging

6 minutes each, totaling 240 hours of speech and about 3 million words (Godfrey

et al., 1992). Such corpora of spoken language don’t have punctuation but do intro-

duce other complications with regard to deﬁning words. Let’s look at one utterance

from Switchboard; an utterance is the spoken correlate of a sentence:

utterance

I do uh main- mainly business data processing

20 CHAPTER 2 • REGULAR EXPRESSIONS, TEXT NORMALIZATION, EDIT DISTANCE

This utterance has two kinds of disﬂuencies. The broken-off word main- is

disﬂuency

called a fragment. Words like uh and um are called ﬁllers or ﬁlled pauses. Should

fragment

ﬁlled pause

we consider these to be words? Again, it depends on the application. If we are

building a speech transcription system, we might want to eventually strip out the

disﬂuencies.

But we also sometimes keep disﬂuencies around. Disﬂuencies like uh or um

are actually helpful in speech recognition in predicting the upcoming word, because

they may signal that the speaker is restarting the clause or idea, and so for speech

recognition they are treated as regular words. Because people use different disﬂu-

encies they can also be a cue to speaker identiﬁcation. In fact Clark and Fox Tree

(2002) showed that uh and um have different meanings. What do you think they are?

Are capitalized tokens like They and uncapitalized tokens like they the same

word? These are lumped together in some tasks (speech recognition), while for part-

of-speech or named-entity tagging, capitalization is a useful feature and is retained.

How about inﬂected forms like cats versus cat? These two words have the same

lemma cat but are different wordforms. A lemma is a set of lexical forms having

lemma

the same stem, the same major part-of-speech, and the same word sense. The word-

form is the full inﬂected or derived form of the word. For morphologically complex

wordform

languages like Arabic, we often need to deal with lemmatization. For many tasks in

English, however, wordforms are sufﬁcient.

How many words are there in English? To answer this question we need to

distinguish two ways of talking about words. Types are the number of distinct words

word type

in a corpus; if the set of words in the vocabulary is V, the number of types is the

vocabulary size |V|. Tokens are the total number N of running words. If we ignore

word token

punctuation, the following Brown sentence has 16 tokens and 14 types:

They picnicked by the pool, then lay back on the grass and looked at the stars.

When we speak about the number of words in the language, we are generally

referring to word types.

Corpus Tokens = N Types = |V|

Shakespeare 884 thousand 31 thousand

Brown corpus 1 million 38 thousand

Switchboard telephone conversations 2.4 million 20 thousand

COCA 440 million 2 million

Google N-grams 1 trillion 13 million

Figure 2.10 Rough numbers of types and tokens for some English language corpora. The

largest, the Google N-grams corpus, contains 13 million types, but this count only includes

types appearing 40 or more times, so the true number would be much larger.

Fig. 2.10 shows the rough numbers of types and tokens computed from some

popular English corpora. The larger the corpora we look at, the more word types

we ﬁnd, and in fact this relationship between the number of types |V| and number

of tokens N is called Herdan’s Law (Herdan, 1960) or Heaps’ Law (Heaps, 1978)

Herdan’s Law

Heaps’ Law

after its discoverers (in linguistics and information retrieval respectively). It is shown

in Eq. 2.1, where k and β are positive constants, and 0 < β < 1.

|V| = kN

(2.1)

The value of β depends on the corpus size and the genre, but at least for the

large corpora in Fig. 2.10, β ranges from .67 to .75. Roughly then we can say that

剩余557页未读，继续阅读

chenlong789

粉丝: 3
资源: 9

自然语言处理综论（第3版）：神经网络与深度学习详解

自然语言处理综述第三版

自然语言处理综论（中英文）

《语音与语言处理》第三版(草稿)-《Speech and Language Processing, 3rd edition draft》

Speech and Language Processing_2nd_draft

Speech and Language Processing 2nd edition.pdf

SPEECH and LANGUAGE PROCESSING

Speech and Language Processing

Speech and Language Processing, 2nd - 2008 PDF 非草稿

ECMAScript Edition 4 Draft

draft:人类程序设计语言设计稿

最新资源