Best Practices for Convolutional Neural Networks
Applied to Visual Document Analysis
Patrice Y. Simard, Dave Steinkraus, John C. Platt
Microsoft Research, One Microsoft Way, Redmond WA 98052
{patrice,v-davste,jplatt}@microsoft.com
Abstract
Neural networks are a powerful technology for
classification of visual inputs arising from documents.
However, there is a confusing plethora of different neural
network methods that are used in the literature and in
industry. This paper describes a set of concrete best
practices that document analysis researchers can use to
get good results with neural networks. The most
important practice is getting a training set as large as
possible: we expand the training set by adding a new
form of distorted data. The next most important practice
is that convolutional neural networks are better suited for
visual document tasks than fully connected networks. We
propose that a simple “do-it-yourself” implementation of
convolution with a flexible architecture is suitable for
many visual document problems. This simple
convolutional neural network does not require complex
methods, such as momentum, weight decay, structure-
dependent learning rates, averaging layers, tangent prop,
or even finely-tuning the architecture. The end result is a
very simple yet general architecture which can yield
state-of-the-art performance for document analysis. We
illustrate our claims on the MNIST set of English digit
images.
1. Introduction
After being extremely popular in the early 1990s,
neural networks have fallen out of favor in research in the
last 5 years. In 2000, it was even pointed out by the
organizers of the Neural Information Processing System
(NIPS) conference that the term “neural networks” in the
submission title was negatively correlated with
acceptance. In contrast, positive correlations were made
with support vector machines (SVMs), Bayesian
networks, and variational methods.
In this paper, we show that neural networks achieve
the best performance on a handwriting recognition task
(MNIST). MNIST [7] is a benchmark dataset of images
of segmented handwritten digits, each with 28x28 pixels.
There are 60,000 training examples and 10,000 testing
examples.
Our best performance on MNIST with neural networks
is in agreement with other researchers, who have found
that neural networks continue to yield state-of-the-art
performance on visual document analysis tasks [1][2].
The optimal performance on MNIST was achieved
using two essential practices. First, we created a new,
general set of elastic distortions that vastly expanded the
size of the training set. Second, we used convolutional
neural networks. The elastic distortions are described in
detail in Section 2. Sections 3 and 4 then describe a
generic convolutional neural network architecture that is
simple to implement.
We believe that these two practices are applicable
beyond MNIST, to general visual tasks in document
analysis. Applications range from FAX recognition, to
analysis of scanned documents and cursive recognition
(using a visual representation) in the Tablet PC.
2. Expanding Data Sets through Elastic
Distortions
Synthesizing plausible transformations of data is
simple, but the “inverse” problem – transformation
invariance – can be arbitrarily complicated. Fortunately,
learning algorithms are very good at learning inverse
problems. Given a classification task, one may apply
transformations to generate additional data and let the
learning algorithm infer the transformation invariance.
This invariance is embedded in the parameters, so it is in
some sense free, since the computation at recognition
time is unchanged. If the data is scarce and if the
distribution to be learned has transformation-invariance
properties, generating additional data using
transformations may even improve performance [6]. In
the case of handwriting recognition, we postulate that the
distribution has some invariance with respect to not only
affine transformations, but also elastic deformations
corresponding to uncontrolled oscillations of the hand
muscles, dampened by inertia.
Simple distortions such as translations, rotations, and
skewing can be generated by applying affine
displacement fields to images. This is done by computing
for every pixel a new target location with respect to the
original location. The new target location, at position
(x,y) is given with respect to the previous position. For
instance if ∆x(x,y)=1, and ∆y(x,y)=0, this means that the
new location of every pixel is shifted by 1 to the right. If
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003)
0-7695-1960-1/03 $17.00 © 2003 IEEE