The unreasonable effectiveness of deep learning in
artificial intelligence
Terrence J. Sejnowski
a,b,1
a
Computational Neurobiology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037; and
b
Division of Biological Sciences, University of
California San Diego, La Jolla, CA 92093
Edited by David L. Donoho, Stanford University, Stanford, CA, and approved November 22, 2019 (received for review September 17, 2019)
Deep learning networks have been trained to recognize speech,
caption photographs, and translate text between languages at
high levels of performance. Although applications of deep learn-
ing networks to real-world problems have become ubiquitous, our
understanding of why they are so effective is lacking. These empirical
results should not be possible according to sample complexity in
statistics and nonconvex optimization theory. However, paradoxes
in the training and effectiveness of deep learning networks are
being investigated and insights are being found in the geometry of
high-dimensional spaces. A mathematical theory of deep learning
would illuminate how they function, allow us to assess the strengths
and weaknesses of different network architectures, and lead to
major improvements. Deep learning has provided natural ways for
humans to communicate with digital devices and is foundational for
building artificial general intelligence. Deep learning was inspired by
the architecture of the cerebral cortex and insights into autonomy
and general intelligence may be found in other brain regions that
are essential for planning and survival, but major breakthroughs will
be needed to achieve these goals.
deep learning
|
artificial intelligence
|
neural networks
I
n 1884, Edwin Abbott wrote Flatland: A Romance of Many
Dimensions (1) (Fig. 1). This book was written as a satire on
Victorian society, but it has endured because of its exploration of
how dimensionality can change our intuitions about space. Flat-
land was a 2-dimensional (2D) world inhabited by geometrical
creatures. The mathematics of 2 dimensions was fully understood
by these creatures, with circles being more perfect than triangles.
In it a gentleman square has a dream about a sphere and wakes up
to the possibility that his universe might be much larger than he or
anyone in Flatland could imagine. He was not able to convince
anyone that this was possible and in the end he was imprisoned.
We can easily imagine adding another spatial dimension when
going from a 1-dimensional to a 2D world and from a 2D to a
3-dimensional (3D) world. Lines can intersect themselves in 2 di-
mensions and sheets can fold back onto themselves in 3 dimen-
sions, but imagining how a 3D object can fold back on itself in a
4- di me ns io na l space is a stretch that was achieved by Charles Howard
Hinton in the 19th century (https://en.wikipedia.org /wiki/Charles_
Howard_Hinton). What are the properties of spaces having even
higher dimensions? What is it like to live in a space with 100 dimen-
sions, or a million dimensions, or a space like our brain that has a
million billion dimensions (the number of synapses between neurons)?
The first Neural Information Processing Systems (NeurIPS)
Conference and Workshop took place at the Denver Tech Center
in 1987 (Fig. 2). The 600 attendees were from a wide range of
disciplines, including physics, neuroscience, psychology, statistics,
electrical engineering, computer science, computer vision, speech
recognition, and robotics, but they all had something in common:
They all worked on intractably difficult problems that were not
easily solved with traditional methods and they tended to be out-
liers in their home disciplines. In retrospect, 33 y later, these misfits
were pushing the frontiers of their fields into high-dimensional
spaces populated by big datasets, the world we are living in to-
day. As the president of the foundation that organizes the annual
NeurIPS confer ences, I oversa w the remarkable evolution of a
community that created modern machine learning. This confer-
ence has grown steadily and in 2019 attracted over 14,000 par-
ticipants. Many intractable problems eventually became tractable,
and today machine learning serves as a foundation for contem-
porary artificial intelligence (AI).
The early goals of machine learning were more modest than
those of AI. Rather than aiming directly at general intelligence,
machine learning started by attacking practical problems in
perception, language, motor control, prediction, and inference
using learning from data as the primary tool. In contrast, early
attempts in AI were characterized by low-dimensional algorithms
that were handcrafted. However, this approach only worked for
well-controlled environments. For example, in blocks world all
objects were rectangular solids, identically painted and in an envi-
ronment with fixed lighting. These algorithms did not scale up to
vision in the real world, where objects have complex shapes, a wide
range of reflectances, and lighting conditions are uncontrolled. The
real world is high-dimensional and there may not be any low-
dimensional model that can be fit to it (2). Similar problems were
encountered with early models of natural languages based on
symbols and syntax, which ignored the complexities of semantics
(3). Practical natural language applications became possible once
the compl exity of deep learning language models approached the
complexity of the real world. Models of natural language with
millions of parameters and trained with millions of labeled exam-
ples are now used routinely. Even larger deep learning language
networks are in production today, providing services to millions of
users online, less than a decade since they were introduced.
Origins of Deep Learning
I have written a book, The Deep Learning Revolution: Artificial
Intelligence Meets Human Intelligence (4), which tells the story of
how deep learning came about. Deep learning was inspired by
the massively parallel architecture found in brains and its origins
can be traced to Frank Rosenblatt’s perceptron (5) in the 1950s
that was based on a simplified model of a single neuron in-
troduced by McCulloch and Pitts (6). The perceptron performed
pattern recognition and learned to classify labeled examples (Fig.
3). Rosenblatt proved a theorem that if there was a set of pa-
rameters that could classify new inputs correctly, and there were
This paper results from the Arthur M. Sackler Colloquium of the National Academy of
Sciences, “The Science of Deep Learning,” held March 13–14, 2019, at the National Acad-
emy of Sciences in Washington, DC. NAS colloquia began in 1991 and have been pub-
lished in PNAS since 1995. From February 2001 through May 2019 colloquia were
supported by a generous gift from The Dame Jillian and Dr. Arthur M. Sackler Foun-
dation for the Arts, Scienc es, & Humanit ies, in m emory of Dame Sackler’shusband,
Arthur M. Sackler. The comp lete program and video recordings of most presentations
are availabl e on the NAS website at www. nasonlin e.org/sc ien ce-of- deep-learni ng.
Author contributions: T.J.S. wrote the paper.
The author declares no competing interest.
This article is a PNAS Direct Submission.
Published under the PNAS license.
1
Email: terry@salk.edu.
www.pnas.org/cgi/doi/10.1073/pnas.1907373117 PNAS Latest Articles
|
1of6
NEUROSCIENCECOMPUTER SCIENCES COLLOQUIUM
PAPER
Downloaded by guest on February 23, 2020