COLLOQUIUM
PAPER
COMPUTER SCIENCES
Emergent linguistic structure in artificial neural
networks trained by self-supervision
Christopher D. Manning
a,1
, Kevin Clark
a
, John Hewitt
a
, Urvashi Khandelwal
a
, and Omer Levy
b
a
Computer Science Department, Stanford University, Stanford, CA 94305; and
b
Facebook Artificial Intelligence Research, Facebook Inc., Seattle, WA 98109
Edited by Matan Gavish, Hebrew University of Jerusalem, Jerusalem, Israel, and accepted by Editorial Board Member David L. Donoho April 13, 2020
(received for review June 3, 2019)
This paper explores the knowledge of linguistic structure learned
by large artificial neural networks, trained via self-supervision,
whereby the model simply tries to predict a masked word in a
given context. Human language communication is via sequences
of words, but language understanding requires constructing rich
hierarchical structures that are never observed explicitly. The
mechanisms for this have been a prime mystery of human
language acquisition, while engineering work has mainly pro-
ceeded by supervised learning on treebanks of sentences hand
labeled for this latent structure. However, we demonstrate that
modern deep contextual language models learn major aspects
of this structure, without any explicit supervision. We develop
methods for identifying linguistic hierarchical structure emer-
gent in artificial neural networks and demonstrate that com-
ponents in these models focus on syntactic grammatical rela-
tionships and anaphoric coreference. Indeed, we show that a
linear transformation of learned embeddings in these models
captures parse tree distances to a surprising degree, allowing
approximate reconstruction of the sentence tree structures nor-
mally assumed by linguists. These results help explain why these
models have brought such large improvements across many
language-understanding tasks.
artificial neural netwok | self-supervision | syntax | learning
H
uman language communication is via sequences of words,
canonically produced as a mainly continuous speech stream
(1). Behind this linear organization is a rich hierarchical language
structure with additional links (such as coreference between
mentions) that needs to be understood by a hearer (or reader).
In Fig. 1, for instance, a hearer has to understand a sentence
structure roughly like the one shown to realize that the chef was
out of food rather than the store.
*
Language understanding, like
vision, can be seen as an inverse problem (3), where the hearer
has to reconstruct structures and causes from the observed
surface form.
In computational linguistics, the long dominant way of
addressing this structure induction problem has been to hand
design linguistic representations, broadly following proposals
from linguistics proper. Under one set of conventions, the sen-
tence in Fig. 1 would be annotated with the structure shown.
Humans then label many natural language sentences with their
underlying structure. Such datasets of annotated human lan-
guage structure, known as treebanks (4, 5), have fueled much
of the research in the field in the last 25 y. Researchers
train progressively better supervised machine-learning mod-
els on the treebank, which attempt to recover this structure
for any sentence (6–8). This approach has been very effec-
tive as an engineering solution, but beyond the high prac-
tical cost of human labeling, it gives no insight into how
children might approach structure induction from observed
data alone.
Recently, enormous progress has been made in natural lan-
guage representation learning by adopting a self-supervised
learning approach. In self-supervised learning, a system is given
no explicit labeling of raw data, but it is able to construct its
own supervised learning problems by choosing to interpret some
of the data as a “label” to be predicted.
†
The canonical case
for human language is the language-modeling task of trying
to predict the next word in an utterance based on the tempo-
rally preceding words (Fig. 2). Variant tasks include the masked
language-modeling task of predicting a masked word in a text
[a.k.a. the cloze task (11)] and predicting the words likely to
occur around a given word (12, 13). Autoencoders (14) can
also be thought of as self-supervised learning systems. Since no
explicit labeling of the data is required, self-supervised learning
is a type of unsupervised learning, but the approach of self-
generating supervised learning objectives differentiates it from
other unsupervised learning techniques such as clustering.
One might expect that a machine-learning model trained to
predict the next word in a text will just be a giant associa-
tional learning machine, with lots of statistics on how often the
word restaurant is followed by kitchen and perhaps some basic
abstracted sequence knowledge such as knowing that adjectives
are commonly followed by nouns in English. It is not at all clear
that such a system can develop interesting knowledge of the lin-
guistic structure of whatever human language the system is trained
on. Indeed, this has been the dominant perspective in linguis-
tics, where language models have long been seen as inadequate
and having no scientific interest, even when their usefulness in
practical engineering applications is grudgingly accepted (15, 16).
Starting in 2018, researchers in natural language process-
ing (NLP) built a new generation of much larger artificial
This paper results from the Arthur M. Sackler Colloquium of the National Academy of
Sciences, “The Science of Deep Learning,” held March 13–14, 2019, at the National
Academy of Sciences in Washington, DC. NAS colloquia began in 1991 and have been
published in PNAS since 1995. From February 2001 through May 2019 colloquia were
supported by a generous gift from The Dame Jillian and Dr. Arthur M. Sackler Foun-
dation for the Arts, Sciences, & Humanities, in memory of Dame Sackler’s husband,
Arthur M. Sackler. The complete program and video recordings of most presenta-
tions are available on the NAS website at http://www.nasonline.org/science-of-deep-
learning.y
Author contributions: C.D.M., K.C., J.H., U.K., and O.L. designed research; K.C., J.H.,
and U.K. performed research; and C.D.M., K.C., J.H., U.K., and O.L. wrote the
paper. y
Competing interest statement: K.C. and U.K. have been/are employed part time at
Google Inc., and K.C. has a Google PhD Fellowship. Researchers at Google Inc. developed
the BERT model analyzed in this paper.y
This article is a PNAS Direct Submission. M.G. is a guest editor invited by the Editorial
Board.y
Published under the PNAS license.y
Data deposition: Code and most of the data to reproduce the analyses in
this paper are freely available at https://github.com/clarkkev/attention-analysis and
https://github.com/john-hewitt/structural-probes.y
1
To whom correspondence may be addressed: Email: manning@cs.stanford.edu.y
*There are two main approaches to depicting a sentence’s syntactic structure: phrase
structure (or constituency) and dependency structure (or grammatical relations). The
former is dominant in modern linguistics, but in this paper we use the latter, which is
dominant in computational linguistics. Both representations capture similar, although
generally not identical, information (2).
†
The approach of self-supervised learning has existed for decades, used particularly in
robotics, e.g., refs. 9 and 10, but it has recently been revived as a focus of interest, used
also for vision and language.
www.pnas.org/cgi/doi/10.1073/pnas.1907367117 PNAS Latest Articles | 1 of 9
Downloaded by guest on June 5, 2020