自监督训练塑造人工神经网络的语言结构理解

需积分: 9 201 浏览量更新于2024-09-02 收藏 4.54MB PDF 举报

"本文深入研究了通过自监督训练学习到的人工神经网络中的语言结构知识。大型神经网络尝试预测上下文中的掩蔽词，以此方式进行训练。人类语言以词汇序列的形式进行，但理解语言需要构建未曾明确定义的丰富层次结构。这个过程在人类语言习得中一直是个谜，传统工程方法主要依赖于对带有潜在结构标注的句子库进行有监督学习。" 在人工智能领域，尤其是自然语言处理（NLP）中，自监督学习是一种新兴且强大的训练方法。这种方法不再依赖于大量人工标注的数据，而是利用数据本身的内在结构来生成监督信号。在自监督训练中，神经网络模型通常会接收到一段含有部分遮挡或缺失信息的文本，其任务是预测这些被遮挡的单词或短语，这一过程被称为“掩蔽语言模型”（Masked Language Model, MLM）。本论文的焦点在于探讨这种无监督或自监督学习方式如何让神经网络自发地学习到语言的深层结构。当模型在大规模文本数据上进行训练时，它能够捕获到语言中的模式、语法和语义规则。例如，神经网络可能会学习到词序的重要性，因为在一个句子中，单词的位置常常决定了它的意义。此外，它还可能学到词性、句法结构、依赖关系等语言学概念，这些都是理解复杂语言表达的关键组成部分。文中提到的人类语言交流的序列性质与神经网络模型的输入输出形式相吻合。然而，理解语言不仅仅涉及到词汇的线性排列，更重要的是理解和构建隐藏在这些序列下的层次结构。这包括了主题-评论结构、从句嵌套、动词短语等复杂的语言构造。在没有显式标注的情况下，神经网络如何自我发现并建模这些结构，是研究的重点。传统的NLP方法，如依赖于树库的有监督学习，通常需要人为标注的句子结构，例如依存关系树或句法树。这样的方法虽然在特定任务上取得了成功，但其规模和复杂性受到人工标注数据的限制。相比之下，自监督学习能够处理更大规模的数据，并且在一定程度上揭示了语言结构的自动学习能力。文章指出，自监督训练的神经网络所学习到的这些语言结构知识对于自然语言处理任务（如机器翻译、问答系统、情感分析等）具有广泛的应用价值。它们可以提高模型的泛化能力，使其更好地理解不同上下文中的语言含义，进而改善人机交互的自然性和准确性。这篇论文揭示了自监督训练在人工神经网络中学习和表现语言结构的能力，为理解人工智能如何模拟人类语言提供了新的视角。未来的研究将继续探索如何优化自监督策略，以更有效地模拟和利用这些涌现的语言结构，进一步推动NLP领域的进展。

COLLOQUIUM

PAPER

COMPUTER SCIENCES

Emergent linguistic structure in artiﬁcial neural

networks trained by self-supervision

Christopher D. Manning

a,1

, Kevin Clark

, John Hewitt

, Urvashi Khandelwal

, and Omer Levy

Computer Science Department, Stanford University, Stanford, CA 94305; and

Facebook Artiﬁcial Intelligence Research, Facebook Inc., Seattle, WA 98109

Edited by Matan Gavish, Hebrew University of Jerusalem, Jerusalem, Israel, and accepted by Editorial Board Member David L. Donoho April 13, 2020

(received for review June 3, 2019)

This paper explores the knowledge of linguistic structure learned

by large artiﬁcial neural networks, trained via self-supervision,

whereby the model simply tries to predict a masked word in a

given context. Human language communication is via sequences

of words, but language understanding requires constructing rich

hierarchical structures that are never observed explicitly. The

mechanisms for this have been a prime mystery of human

language acquisition, while engineering work has mainly pro-

ceeded by supervised learning on treebanks of sentences hand

labeled for this latent structure. However, we demonstrate that

modern deep contextual language models learn major aspects

of this structure, without any explicit supervision. We develop

methods for identifying linguistic hierarchical structure emer-

gent in artiﬁcial neural networks and demonstrate that com-

ponents in these models focus on syntactic grammatical rela-

tionships and anaphoric coreference. Indeed, we show that a

linear transformation of learned embeddings in these models

captures parse tree distances to a surprising degree, allowing

approximate reconstruction of the sentence tree structures nor-

mally assumed by linguists. These results help explain why these

models have brought such large improvements across many

language-understanding tasks.

artiﬁcial neural netwok | self-supervision | syntax | learning

uman language communication is via sequences of words,

canonically produced as a mainly continuous speech stream

(1). Behind this linear organization is a rich hierarchical language

structure with additional links (such as coreference between

mentions) that needs to be understood by a hearer (or reader).

In Fig. 1, for instance, a hearer has to understand a sentence

structure roughly like the one shown to realize that the chef was

out of food rather than the store.

Language understanding, like

vision, can be seen as an inverse problem (3), where the hearer

has to reconstruct structures and causes from the observed

surface form.

In computational linguistics, the long dominant way of

addressing this structure induction problem has been to hand

design linguistic representations, broadly following proposals

from linguistics proper. Under one set of conventions, the sen-

tence in Fig. 1 would be annotated with the structure shown.

Humans then label many natural language sentences with their

underlying structure. Such datasets of annotated human lan-

guage structure, known as treebanks (4, 5), have fueled much

of the research in the ﬁeld in the last 25 y. Researchers

train progressively better supervised machine-learning mod-

els on the treebank, which attempt to recover this structure

for any sentence (6–8). This approach has been very effec-

tive as an engineering solution, but beyond the high prac-

tical cost of human labeling, it gives no insight into how

children might approach structure induction from observed

data alone.

Recently, enormous progress has been made in natural lan-

guage representation learning by adopting a self-supervised

learning approach. In self-supervised learning, a system is given

no explicit labeling of raw data, but it is able to construct its

own supervised learning problems by choosing to interpret some

of the data as a “label” to be predicted.

†

The canonical case

for human language is the language-modeling task of trying

to predict the next word in an utterance based on the tempo-

rally preceding words (Fig. 2). Variant tasks include the masked

language-modeling task of predicting a masked word in a text

[a.k.a. the cloze task (11)] and predicting the words likely to

occur around a given word (12, 13). Autoencoders (14) can

also be thought of as self-supervised learning systems. Since no

explicit labeling of the data is required, self-supervised learning

is a type of unsupervised learning, but the approach of self-

generating supervised learning objectives differentiates it from

other unsupervised learning techniques such as clustering.

One might expect that a machine-learning model trained to

predict the next word in a text will just be a giant associa-

tional learning machine, with lots of statistics on how often the

word restaurant is followed by kitchen and perhaps some basic

abstracted sequence knowledge such as knowing that adjectives

are commonly followed by nouns in English. It is not at all clear

that such a system can develop interesting knowledge of the lin-

guistic structure of whatever human language the system is trained

on. Indeed, this has been the dominant perspective in linguis-

tics, where language models have long been seen as inadequate

and having no scientiﬁc interest, even when their usefulness in

practical engineering applications is grudgingly accepted (15, 16).

Starting in 2018, researchers in natural language process-

ing (NLP) built a new generation of much larger artiﬁcial

This paper results from the Arthur M. Sackler Colloquium of the National Academy of

Sciences, “The Science of Deep Learning,” held March 13–14, 2019, at the National

Academy of Sciences in Washington, DC. NAS colloquia began in 1991 and have been

published in PNAS since 1995. From February 2001 through May 2019 colloquia were

supported by a generous gift from The Dame Jillian and Dr. Arthur M. Sackler Foun-

dation for the Arts, Sciences, & Humanities, in memory of Dame Sackler’s husband,

Arthur M. Sackler. The complete program and video recordings of most presenta-

tions are available on the NAS website at http://www.nasonline.org/science-of-deep-

learning.y

Author contributions: C.D.M., K.C., J.H., U.K., and O.L. designed research; K.C., J.H.,

and U.K. performed research; and C.D.M., K.C., J.H., U.K., and O.L. wrote the

paper. y

Competing interest statement: K.C. and U.K. have been/are employed part time at

Google Inc., and K.C. has a Google PhD Fellowship. Researchers at Google Inc. developed

the BERT model analyzed in this paper.y

This article is a PNAS Direct Submission. M.G. is a guest editor invited by the Editorial

Board.y

Published under the PNAS license.y

Data deposition: Code and most of the data to reproduce the analyses in

this paper are freely available at https://github.com/clarkkev/attention-analysis and

https://github.com/john-hewitt/structural-probes.y

To whom correspondence may be addressed: Email: manning@cs.stanford.edu.y

*There are two main approaches to depicting a sentence’s syntactic structure: phrase

structure (or constituency) and dependency structure (or grammatical relations). The

former is dominant in modern linguistics, but in this paper we use the latter, which is

dominant in computational linguistics. Both representations capture similar, although

generally not identical, information (2).

†

The approach of self-supervised learning has existed for decades, used particularly in

robotics, e.g., refs. 9 and 10, but it has recently been revived as a focus of interest, used

also for vision and language.

www.pnas.org/cgi/doi/10.1073/pnas.1907367117 PNAS Latest Articles | 1 of 9

Downloaded by guest on June 5, 2020

下载后可阅读完整内容，剩余8页未读，立即下载

syp_net

粉丝: 159
资源: 1187

自监督训练塑造人工神经网络的语言结构理解

深度学习研究综述.pdf

机器阅读理解: 预训练语言模型

循环神经网络在自然语言处理中的应用

深入研究卷积神经网络在自然语言处理中的应用

MATLAB深度学习开发入门：神经网络结构、训练与调参

MATLAB神经网络与人工智能：探索神经网络在人工智能中的关键作用

神经网络在自然语言处理中的应用：文本分类与情感分析，解锁NLP新技能

神经网络改变自然语言处理的7个突破性应用：从入门到精通

卷积神经网络中的生成对抗网络

【神经网络与反向传播】：构建深度网络模型，深度探索监督学习

最新资源