Preface to the Second Edition
There have been substantial changes in the field of neural networks since the first
edition of this book in 1998. Some of them have been driven by external factors
such as the increase of available data and computing power. The Internet made
public massive amounts of labeled and unlabeled data. The ever-increasing raw
mass of user-generated and sensed data is made easily accessible by databases
and Web crawlers. Nowadays, anyone having an Internet connection can parse
the 4,000,000+ articles available on Wikipedia and construct a dataset out of
them. Anyone can capture a Web TV stream and obtain days of video content
to test their learning algorithm.
Another development is the amount of available computing power that has
continued to rise at steady rate owing to progress in hardware design and en-
gineering. While the number of cycles per second of processors has thresholded
due to physics limitations, the slow-down has been offset by the emergence of
processing parallelism, best exemplified by the massively parallel graphics pro-
cessing units (GPU). Nowadays, everybody can buy a GPU board (usually al-
ready available in consumer-grade laptops), install free GPU software, and run
computation-intensive simulations at low cost.
These developments have raised the following question: Can we make use of
this large computing power to make sense of these increasingly complex datasets?
Neural networks are a promising approach, as they have the intrinsic modeling
capacity and flexibility to represent the solution. Their intrinsically distributed
nature allows one to leverage the massively parallel computing resources.
During the last two decades, the focus of neural network research and the
practice of training neural networks underwent important changes. Learning in
deep (or “deep learning”) has to a certain degree displacedtheoncemorepreva-
lent regularization issues, or more precisely, changed the practice of regularizing
neural networks. Use of unlabeled data via unsupervised layer-wise pretrain-
ing or deep unsupervised embeddings is now often preferred over traditional
regularization schemes such as weight decay or restricted connectivity. This new
paradigm has started to spread over a large number of applications such as image
recognition, speech recognition, natural language processing, complex systems,
neuroscience, and computational physics.
The second edition of the book reloads the first edition with more tricks.
These tricks arose from 14 years of theory and experimentation (from 1998
to 2012) by some of the world’s most prominent neural networks researchers.
These tricks can make a substantial difference (in terms of speed, ease of im-
plementation, and accuracy) when it comes to putting algorithms to work on
real problems. Tricks may not necessarily have solid theoretical foundations or
formal validation. As Yoshua Bengio states in Chap. 19, “the wisdom distilled
here should be taken as a guideline, to be tried and challenged, not as a practice
set in stone” [1].