神经网络架构：从零开始的自然语言处理

需积分: 31 140 浏览量更新于2024-07-21 收藏 415KB PDF 举报

本文档探讨了"自然语言处理（几乎）从零开始"这一主题，由Ronan Collobert等人在2011年的《机器学习研究》期刊上发表。论文的核心内容是提出了一种统一的神经网络架构和学习算法，旨在解决多种自然语言处理任务，如词性标注、 chunking、命名实体识别以及语义角色标注。作者强调了这种方法的灵活性，旨在减少对特定任务的工程化设计依赖，转而通过大量未标注训练数据学习内在表示。传统的自然语言处理方法通常依赖于人为构建的特征，这些特征针对每项任务进行了精细优化。然而，该研究者们试图打破这种模式，开发出一个系统，能够自动从大量文本数据中学习，而不是依赖于预先定义的特征。这种方法的主要优点在于其通用性和潜在的泛化能力，能够在没有或较少人工干预的情况下，提高系统的性能和适应性。文章的亮点在于构建了一个开源的词性标注系统，该系统展示了在无需过多特定任务特化的前提下，仍能达到相当不错的性能。这种方法挑战了传统NLP中的专业知识局限，为开发更加高效和灵活的自然语言处理工具开辟了新的途径。此外，该研究也为后续的深度学习在NLP领域的进一步发展奠定了基础，尤其是在无监督或弱监督学习方面，为如何利用大数据驱动模型学习提供了有价值的思路。通过阅读这篇论文，读者将深入了解如何运用神经网络技术来处理自然语言，并理解在实际应用中如何平衡模型的通用性和任务特定优化。

COLLOBERT, WESTON, BOTTOU, KARLEN, KAVUKCUOGLU AND KUKSA

Input Sentence

Lookup Table

Convolution

Max Over Time

Linear

HardTanh

Linear

Text The cat sat on the mat

Feature 1

. . .

Feature K

. . .

max(·)

× ·

Padding

× ·

xxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxx

= #tags

Figure 2: Sentence approach network.

In the following, we will describe each layer we use in our networks shown in Figure 1 and Figure 2.

We adopt few notations. Given a matrix A we denote [A]

i, j

the coefﬁcient at row i and column j

in the matrix. We also denote hAi

win

the vector obtained by concatenating the d

win

column vectors

around the i

column vector of matrix A ∈ R

×d

hAi

win



[A]

1,i−d

win

... [A]

,i−d

win

, . . . , [A]

1,i+d

win

... [A]

,i+d

win



2500

NATURAL LANGUAGE PROCESSING (ALMOST) FROM SCRATCH

As a special case, hAi

represents the i

column of matrix A. For a vector v, we denote [v]

the

scalar at index i in the vector. Finally, a sequence of element {x

, x

, . . . , x

} is written [x]

. The i

element of the sequence is [x]

3.2 Transforming Words into Feature Vectors

One of the key points of our architecture is its ability to perform well with the use of (almost

)

raw words. The ability for our method to learn good word representations is thus crucial to our

approach. For efﬁciency, words are fed to our architecture as indices taken from a ﬁnite dictionary

D. Obviously, a simple index does not carry much useful information about the word. However,

the ﬁrst layer of our network maps each of these word indices into a feature vector, by a lookup

table operation. Given a task of interest, a relevant representation of each word is then given by

the corresponding lookup table feature vector, which is trained by backpropagation, starting from

a random initialization.

We will see in Section 4 that we can learn very good word representa-

tions from unlabeled corpora. Our architecture allow us to take advantage of better trained word

representations, by simply initializing the word lookup table with these representations (instead of

randomly).

More formally, for each word w ∈ D , an internal d

wrd

-dimensional feature vector representation

is given by the lookup table layer LT

(·):

(w) = hWi

where W ∈ R

wrd

×|D|

is a matrix of parameters to be learned, hWi

∈ R

wrd

is the w

column of W

and d

wrd

is the word vector size (a hyper-parameter to be chosen by the user). Given a sentence or

any sequence of T words [w]

in D, the lookup table layer applies the same operation for each word

in the sequence, producing the following output matrix:

([w]

) =



hWi

[w]

hWi

[w]

... hWi

[w]



. (1)

This matrix can then be fed to further neural network layers, as we will see below.

3.2.1 EXTENDING TO ANY DISCRETE FEATURES

One might want to provide features other than words if one suspects that these features are helpful

for the task of interest. For example, for the NER task, one could provide a feature which says if a

word is in a gazetteer or not. Another common practice is to introduce some basic pre-processing,

such as word-stemming or dealing with upper and lower case. In this latter option, the word would

be then represented by three discrete features: its lower case stemmed root, its lower case ending,

and a capitalization feature.

Generally speaking, we can consider a word as represented by K discrete features w ∈ D

··· × D

, where D

is the dictionary for the k

feature. We associate to each feature a lookup table

(·), with parameters W

∈ R

wrd

×|D

where d

wrd

∈ N is a user-speciﬁed vector size. Given a

8. We did some pre-processing, namely lowercasing and encoding capitalization as another feature. With enough (un-

labeled) training data, presumably we could learn a model without this processing. Ideally, an even more raw input

would be to learn from letter sequences rather than words, however we felt that this was beyond the scope of this

work.

9. As any other neural network layer.

2501

剩余44页未读，继续阅读

小白vc

粉丝: 7
资源: 9

神经网络架构：从零开始的自然语言处理

Natural Language Processing Fundamentals - Dwight Gunning & Sohom Ghosh(2019)

Deep Learning for Natural Language Processing

自然语言处理几乎从零开始

[Practical Guide]: Building a GAN Model from Scratch: Step-by-Step Optimization for Your First AI ...

2023-04-06-项目笔记 - 第二百八十九阶段 - 4.4.2.287全局变量的作用域-287 -2025.10.17

毕业设计论文SpringBoot小区家政服务预约平台.docx

16.jpg

基于树叶和土壤的蚂蚁图像检测

三维地球-使用React+Three.js开发的三维地球前端-优质项目实战.zip

深度估计-使用Pytorch实现的实时多视图深度估计算法-优质项目实战.zip

最新资源