没有合适的资源?快使用搜索试试~ 我知道了~
首页Lecun1998.pdf
资源详情
资源评论
资源推荐

Gradient-Based Learning Applied
to Document Recognition
YANN LECUN, MEMBER, IEEE, L
´
EON BOTTOU, YOSHUA BENGIO, AND
PATRICK HAFFNER
Invited Paper
Multilayer neural networks trained with the back-propagation
algorithm constitute the best example of a successful gradient-
based learning technique. Given an appropriate network
architecture, gradient-based learning algorithms can be used
to synthesize a complex decision surface that can classify
high-dimensional patterns, such as handwritten characters, with
minimal preprocessing. This paper reviews various methods
applied to handwritten character recognition and compares them
on a standard handwritten digit recognition task. Convolutional
neural networks, which are specifically designed to deal with
the variability of two dimensional (2-D) shapes, are shown to
outperform all other techniques.
Real-life document recognition systems are composed of multiple
modules including field extraction, segmentation, recognition,
and language modeling. A new learning paradigm, called graph
transformer networks (GTN’s), allows such multimodule systems
to be trained globally using gradient-based methods so as to
minimize an overall performance measure.
Two systems for online handwriting recognition are described.
Experiments demonstrate the advantage of global training, and
the flexibility of graph transformer networks.
A graph transformer network for reading a bank check is
also described. It uses convolutional neural network character
recognizers combined with global training techniques to provide
record accuracy on business and personal checks. It is deployed
commercially and reads several million checks per day.
Keywords— Convolutional neural networks, document recog-
nition, finite state transducers, gradient-based learning, graph
transformer networks, machine learning, neural networks, optical
character recognition (OCR).
NOMENCLATURE
GT Graph transformer.
GTN Graph transformer network.
HMM Hidden Markov model.
HOS Heuristic oversegmentation.
K-NN K-nearest neighbor.
Manuscript received November 1, 1997; revised April 17, 1998.
Y. LeCun, L. Bottou, and P. Haffner are with the Speech and Image
Processing Services Research Laboratory, AT&T Labs-Research, Red
Bank, NJ 07701 USA.
Y. Bengio is with the D
´
epartement d’Informatique et de Recherche
Op
´
erationelle, Universit
´
e de Montr
´
eal, Montr
´
eal, Qu
´
ebec H3C 3J7 Canada.
Publisher Item Identifier S 0018-9219(98)07863-3.
NN Neural network.
OCR Optical character recognition.
PCA Principal component analysis.
RBF Radial basis function.
RS-SVM Reduced-set support vector method.
SDNN Space displacement neural network.
SVM Support vector method.
TDNN Time delay neural network.
V-SVM Virtual support vector method.
I. I
NTRODUCTION
Over the last several years, machine learning techniques,
particularly when applied to NN’s, have played an increas-
ingly important role in the design of pattern recognition
systems. In fact, it could be argued that the availability
of learning techniques has been a crucial factor in the
recent success of pattern recognition applications such as
continuous speech recognition and handwriting recognition.
The main message of this paper is that better pattern
recognition systems can be built by relying more on auto-
matic learning and less on hand-designed heuristics. This
is made possible by recent progress in machine learning
and computer technology. Using character recognition as a
case study, we show that hand-crafted feature extraction can
be advantageously replaced by carefully designed learning
machines that operate directly on pixel images. Using
document understanding as a case study, we show that the
traditional way of building recognition systems by manually
integrating individually designed modules can be replaced
by a unified and well-principled design paradigm, called
GTN’s, which allows training all the modules to optimize
a global performance criterion.
Since the early days of pattern recognition it has been
known that the variability and richness of natural data,
be it speech, glyphs, or other types of patterns, make it
almost impossible to build an accurate recognition system
entirely by hand. Consequently, most pattern recognition
systems are built using a combination of automatic learning
techniques and hand-crafted algorithms. The usual method
0018–9219/98$10.00 1998 IEEE
2278 PROCEEDINGS OF THE IEEE, VOL. 86, NO. 11, NOVEMBER 1998

Fig. 1. Traditional pattern recognition is performed with two
modules: a fixed feature extractor and a trainable classifier.
of recognizing individual patterns consists in dividing the
system into two main modules shown in Fig. 1. The first
module, called the feature extractor, transforms the input
patterns so that they can be represented by low-dimensional
vectors or short strings of symbols that: 1) can be easily
matched or compared and 2) are relatively invariant with
respect to transformations and distortions of the input pat-
terns that do not change their nature. The feature extractor
contains most of the prior knowledge and is rather specific
to the task. It is also the focus of most of the design effort,
because it is often entirely hand crafted. The classifier,
on the other hand, is often general purpose and trainable.
One of the main problems with this approach is that the
recognition accuracy is largely determined by the ability of
the designer to come up with an appropriate set of features.
This turns out to be a daunting task which, unfortunately,
must be redone for each new problem. A large amount of
the pattern recognition literature is devoted to describing
and comparing the relative merits of different feature sets
for particular tasks.
Historically, the need for appropriate feature extractors
was due to the fact that the learning techniques used
by the classifiers were limited to low-dimensional spaces
with easily separable classes [1]. A combination of three
factors has changed this vision over the last decade. First,
the availability of low-cost machines with fast arithmetic
units allows for reliance on more brute-force “numerical”
methods than on algorithmic refinements. Second, the avail-
ability of large databases for problems with a large market
and wide interest, such as handwriting recognition, has
enabled designers to rely more on real data and less on
hand-crafted feature extraction to build recognition systems.
The third and very important factor is the availability
of powerful machine learning techniques that can handle
high-dimensional inputs and can generate intricate decision
functions when fed with these large data sets. It can be
argued that the recent progress in the accuracy of speech
and handwriting recognition systems can be attributed in
large part to an increased reliance on learning techniques
and large training data sets. As evidence of this fact, a large
proportion of modern commercial OCR systems use some
form of multilayer NN trained with back propagation.
In this study, we consider the tasks of handwritten
character recognition (Sections I and II) and compare the
performance of several learning techniques on a benchmark
data set for handwritten digit recognition (Section III).
While more automatic learning is beneficial, no learning
technique can succeed without a minimal amount of prior
knowledge about the task. In the case of multilayer NN’s,
a good way to incorporate knowledge is to tailor its archi-
tecture to the task. Convolutional NN’s [2], introduced in
Section II, are an example of specialized NN architectures
which incorporate knowledge about the invariances of two-
dimensional (2-D) shapes by using local connection patterns
and by imposing constraints on the weights. A comparison
of several methods for isolated handwritten digit recogni-
tion is presented in Section III. To go from the recognition
of individual characters to the recognition of words and
sentences in documents, the idea of combining multiple
modules trained to reduce the overall error is introduced
in Section IV. Recognizing variable-length objects such as
handwritten words using multimodule systems is best done
if the modules manipulate directed graphs. This leads to the
concept of trainable GTN, also introduced in Section IV.
Section V describes the now classical method of HOS for
recognizing words or other character strings. Discriminative
and nondiscriminative gradient-based techniques for train-
ing a recognizer at the word level without requiring manual
segmentation and labeling are presented in Section VI.
Section VII presents the promising space-displacement NN
approach that eliminates the need for segmentation heuris-
tics by scanning a recognizer at all possible locations on
the input. In Section VIII, it is shown that trainable GTN’s
can be formulated as multiple generalized transductions
based on a general graph composition algorithm. The
connections between GTN’s and HMM’s, commonly used
in speech recognition, is also treated. Section IX describes
a globally trained GTN system for recognizing handwriting
entered in a pen computer. This problem is known as
“online” handwriting recognition since the machine must
produce immediate feedback as the user writes. The core
of the system is a convolutional NN. The results clearly
demonstrate the advantages of training a recognizer at
the word level, rather than training it on presegmented,
hand-labeled, isolated characters. Section X describes a
complete GTN-based system for reading handwritten and
machine-printed bank checks. The core of the system is
the convolutional NN called LeNet-5, which is described
in Section II. This system is in commercial use in the
NCR Corporation line of check recognition systems for the
banking industry. It is reading millions of checks per month
in several banks across the United States.
A. Learning from Data
There are several approaches to automatic machine learn-
ing, but one of the most successful approaches, popularized
in recent years by the NN community, can be called “nu-
merical” or gradient-based learning. The learning machine
computes a function
where is the th
input pattern, and
represents the collection of adjustable
parameters in the system. In a pattern recognition setting,
LECUN et al.: GRADIENT-BASED LEARNING APPLIED TO DOCUMENT RECOGNITION 2279

the output may be interpreted as the recognized class
label of pattern
or as scores or probabilities associated
with each class. A loss function
measures the discrepancy between the “correct” or
desired output for pattern
and the output produced by
the system. The average loss function
is the
average of the errors
over a set of labeled examples
called the training set
In the
simplest setting, the learning problem consists in finding
the value of
that minimizes In practice,
the performance of the system on a training set is of little
interest. The more relevant measure is the error rate of the
system in the field, where it would be used in practice.
This performance is estimated by measuring the accuracy
on a set of samples disjoint from the training set, which is
called the test set. Much theoretical and experimental work
[3]–[5] has shown that the gap between the expected error
rate on the test set
and the error rate on the training
set
decreases with the number of training samples
approximately as
(1)
where
is the number of training samples, is a measure
of “effective capacity” or complexity of the machine [6],
[7],
is a number between 0.5 and 1.0, and is a constant.
This gap always decreases when the number of training
samples increases. Furthermore, as the capacity
increases,
decreases. Therefore, when increasing the capacity
there is a tradeoff between the decrease of and the
increase of the gap, with an optimal value of the capacity
that achieves the lowest generalization error Most
learning algorithms attempt to minimize
as well as
some estimate of the gap. A formal version of this is called
structural risk minimization [6], [7], and it is based on defin-
ing a sequence of learning machines of increasing capacity,
corresponding to a sequence of subsets of the parameter
space such that each subset is a superset of the previous
subset. In practical terms, structural risk minimization is
implemented by minimizing
where the
function
is called a regularization function and is
a constant.
is chosen such that it takes large values
on parameters
that belong to high-capacity subsets of
the parameter space. Minimizing
in effect limits the
capacity of the accessible subset of the parameter space,
thereby controlling the tradeoff between minimizing the
training error and minimizing the expected gap between
the training error and test error.
B. Gradient-Based Learning
The general problem of minimizing a function with
respect to a set of parameters is at the root of many
issues in computer science. Gradient-based learning draws
on the fact that it is generally much easier to minimize
a reasonably smooth, continuous function than a discrete
(combinatorial) function. The loss function can be mini-
mized by estimating the impact of small variations of the
parameter values on the loss function. This is measured
by the gradient of the loss function with respect to the
parameters. Efficient learning algorithms can be devised
when the gradient vector can be computed analytically (as
opposed to numerically through perturbations). This is the
basis of numerous gradient-based learning algorithms with
continuous-valued parameters. In the procedures described
in this article, the set of parameters
is a real-valued
vector, with respect to which
is continuous, as well
as differentiable almost everywhere. The simplest mini-
mization procedure in such a setting is the gradient descent
algorithm where
is iteratively adjusted as follows:
(2)
In the simplest case,
is a scalar constant. More sophis-
ticated procedures use variable
or substitute it for a
diagonal matrix, or substitute it for an estimate of the
inverse Hessian matrix as in Newton or quasi-Newton
methods. The conjugate gradient method [8] can also be
used. However, Appendix B shows that despite many
claims to the contrary in the literature, the usefulness of
these second-order methods to large learning machines is
very limited.
A popular minimization procedure is the stochastic gra-
dient algorithm, also called the online update. It consists
in updating the parameter vector using a noisy, or approxi-
mated, version of the average gradient. In the most common
instance of it,
is updated on the basis of a single sample
(3)
With this procedure the parameter vector fluctuates around
an average trajectory, but usually it converges considerably
faster than regular gradient descent and second-order meth-
ods on large training sets with redundant samples (such
as those encountered in speech or character recognition).
The reasons for this are explained in Appendix B. The
properties of such algorithms applied to learning have been
studied theoretically since the 1960’s [9]–[11], but practical
successes for nontrivial tasks did not occur until the mid
eighties.
C. Gradient Back Propagation
Gradient-based learning procedures have been used since
the late 1950’s, but they were mostly limited to linear
systems [1]. The surprising usefulness of such simple
gradient descent techniques for complex machine learning
tasks was not widely realized until the following three
events occurred. The first event was the realization that,
despite early warnings to the contrary [12], the presence of
local minima in the loss function does not seem to be a
major problem in practice. This became apparent when it
was noticed that local minima did not seem to be a major
impediment to the success of early nonlinear gradient-based
learning techniques such as Boltzmann machines [13], [14].
The second event was the popularization by Rumelhart et
al. [15] and others of a simple and efficient procedure
to compute the gradient in a nonlinear system composed
2280 PROCEEDINGS OF THE IEEE, VOL. 86, NO. 11, NOVEMBER 1998

of several layers of processing, i.e., the back-propagation
algorithm. The third event was the demonstration that the
back-propagation procedure applied to multilayer NN’s
with sigmoidal units can solve complicated learning tasks.
The basic idea of back propagation is that gradients can
be computed efficiently by propagation from the output to
the input. This idea was described in the control theory
literature of the early 1960’s [16], but its application to ma-
chine learning was not generally realized then. Interestingly,
the early derivations of back propagation in the context
of NN learning did not use gradients but “virtual targets”
for units in intermediate layers [17], [18], or minimal
disturbance arguments [19]. The Lagrange formalism used
in the control theory literature provides perhaps the best
rigorous method for deriving back propagation [20] and for
deriving generalizations of back propagation to recurrent
networks [21] and networks of heterogeneous modules [22].
A simple derivation for generic multilayer systems is given
in Section I-E.
The fact that local minima do not seem to be a problem
for multilayer NN’s is somewhat of a theoretical mystery.
It is conjectured that if the network is oversized for the
task (as is usually the case in practice), the presence of
“extra dimensions” in parameter space reduces the risk
of unattainable regions. Back propagation is by far the
most widely used neural-network learning algorithm, and
probably the most widely used learning algorithm of any
form.
D. Learning in Real Handwriting Recognition Systems
Isolated handwritten character recognition has been ex-
tensively studied in the literature (see [23] and [24] for
reviews), and it was one of the early successful applications
of NN’s [25]. Comparative experiments on recognition of
individual handwritten digits are reported in Section III.
They show that NN’s trained with gradient-based learning
perform better than all other methods tested here on the
same data. The best NN’s, called convolutional networks,
are designed to learn to extract relevant features directly
from pixel images (see Section II).
One of the most difficult problems in handwriting recog-
nition, however, is not only to recognize individual charac-
ters, but also to separate out characters from their neighbors
within the word or sentence, a process known as seg-
mentation. The technique for doing this that has become
the “standard” is called HOS. It consists of generating a
large number of potential cuts between characters using
heuristic image processing techniques, and subsequently
selecting the best combination of cuts based on scores
given for each candidate character by the recognizer. In
such a model, the accuracy of the system depends upon the
quality of the cuts generated by the heuristics, and on the
ability of the recognizer to distinguish correctly segmented
characters from pieces of characters, multiple characters,
or otherwise incorrectly segmented characters. Training a
recognizer to perform this task poses a major challenge
because of the difficulty in creating a labeled database
of incorrectly segmented characters. The simplest solution
consists of running the images of character strings through
the segmenter and then manually labeling all the character
hypotheses. Unfortunately, not only is this an extremely
tedious and costly task, it is also difficult to do the labeling
consistently. For example, should the right half of a cut-up
four be labeled as a one or as a noncharacter? Should the
right half of a cut-up eight be labeled as a three?
The first solution, described in Section V, consists of
training the system at the level of whole strings of char-
acters rather than at the character level. The notion of
gradient-based learning can be used for this purpose. The
system is trained to minimize an overall loss function which
measures the probability of an erroneous answer. Section V
explores various ways to ensure that the loss function
is differentiable and therefore lends itself to the use of
gradient-based learning methods. Section V introduces the
use of directed acyclic graphs whose arcs carry numerical
information as a way to represent the alternative hypotheses
and introduces the idea of GTN.
The second solution, described in Section VII, is to
eliminate segmentation altogether. The idea is to sweep
the recognizer over every possible location on the input
image, and to rely on the “character spotting” property
of the recognizer, i.e., its ability to correctly recognize
a well-centered character in its input field, even in the
presence of other characters besides it, while rejecting
images containing no centered characters [26], [27]. The
sequence of recognizer outputs obtained by sweeping the
recognizer over the input is then fed to a GTN that takes
linguistic constraints into account and finally extracts the
most likely interpretation. This GTN is somewhat similar
to HMM’s, which makes the approach reminiscent of the
classical speech recognition [28], [29]. While this technique
would be quite expensive in the general case, the use of
convolutional NN’s makes it particularly attractive because
it allows significant savings in computational cost.
E. Globally Trainable Systems
As stated earlier, most practical pattern recognition sys-
tems are composed of multiple modules. For example, a
document recognition system is composed of a field loca-
tor (which extracts regions of interest), a field segmenter
(which cuts the input image into images of candidate
characters), a recognizer (which classifies and scores each
candidate character), and a contextual postprocessor, gen-
erally based on a stochastic grammar (which selects the
best grammatically correct answer from the hypotheses
generated by the recognizer). In most cases, the information
carried from module to module is best represented as
graphs with numerical information attached to the arcs.
For example, the output of the recognizer module can be
represented as an acyclic graph where each arc contains the
label and the score of a candidate character, and where each
path represents an alternative interpretation of the input
string. Typically, each module is manually optimized, or
sometimes trained, outside of its context. For example, the
character recognizer would be trained on labeled images
of presegmented characters. Then the complete system is
LECUN et al.: GRADIENT-BASED LEARNING APPLIED TO DOCUMENT RECOGNITION 2281

assembled, and a subset of the parameters of the modules
is manually adjusted to maximize the overall performance.
This last step is extremely tedious, time consuming, and
almost certainly suboptimal.
A better alternative would be to somehow train the entire
system so as to minimize a global error measure such
as the probability of character misclassifications at the
document level. Ideally, we would want to find a good
minimum of this global loss function with respect to all the
parameters in the system. If the loss function
measuring
the performance can be made differentiable with respect
to the system’s tunable parameters
we can find a local
minimum of
using gradient-based learning. However, at
first glance, it appears that the sheer size and complexity
of the system would make this intractable.
To ensure that the global loss function
is
differentiable, the overall system is built as a feedforward
network of differentiable modules. The function imple-
mented by each module must be continuous and differ-
entiable almost everywhere with respect to the internal
parameters of the module (e.g., the weights of an NN
character recognizer in the case of a character recognition
module), and with respect to the module’s inputs. If this is
the case, a simple generalization of the well-known back-
propagation procedure can be used to efficiently compute
the gradients of the loss function with respect to all the
parameters in the system [22]. For example, let us consider
a system built as a cascade of modules, each of which
implements a function
where
is a vector representing the output of the module, is
the vector of tunable parameters in the module (a subset of
and is the module’s input vector (as well as the
previous module’s output vector). The input
to the first
module is the input pattern
If the partial derivative of
with respect to is known, then the partial derivatives
of
with respect to and can be computed using
the backward recurrence
(4)
where
is the Jacobian of with
respect to
evaluated at the point and
is the Jacobian of with respect to
The Jacobian of a vector function is a matrix containing
the partial derivatives of all the outputs with respect to
all the inputs. The first equation computes some terms
of the gradient of
while the second equation
generates a backward recurrence, as in the well-known
back-propagation procedure for NN’s. We can average
the gradients over the training patterns to obtain the full
gradient. It is interesting to note that in many instances
there is no need to explicitly compute the Jacobian ma-
trix. The above formula uses the product of the Jacobian
with a vector of partial derivatives, and it is often easier
to compute this product directly without computing the
Jacobian beforehand. In analogy with ordinary multilayer
NN’s, all but the last module are called hidden layers
because their outputs are not observable from the outside.
In more complex situations than the simple cascade of
modules described above, the partial derivative notation
becomes somewhat ambiguous and awkward. A completely
rigorous derivation in more general cases can be done using
Lagrange functions [20]–[22].
Traditional multilayer NN’s are a special case of the
above where the state information
is represented
with fixed-sized vectors, and where the modules are
alternated layers of matrix multiplications (the weights)
and component-wise sigmoid functions (the neurons).
However, as stated earlier, the state information in complex
recognition system is best represented by graphs with
numerical information attached to the arcs. In this case,
each module, called a GT, takes one or more graphs as input
and produces a graph as output. Networks of such modules
are called GTN’s. Sections IV, VI, and VIII develop the
concept of GTN’s and show that gradient-based learning
can be used to train all the parameters in all the modules
so as to minimize a global loss function. It may seem
paradoxical that gradients can be computed when the state
information is represented by essentially discrete objects
such as graphs, but that difficulty can be circumvented,
as shown later.
II. C
ONVOLUTIONAL NEURAL NETWORKS FOR
ISOLATED CHARACTER RECOGNITION
The ability of multilayer networks trained with gradi-
ent descent to learn complex, high-dimensional, nonlinear
mappings from large collections of examples makes them
obvious candidates for image recognition tasks. In the
traditional model of pattern recognition, a hand-designed
feature extractor gathers relevant information from the input
and eliminates irrelevant variabilities. A trainable classifier
then categorizes the resulting feature vectors into classes. In
this scheme, standard, fully connected multilayer networks
can be used as classifiers. A potentially more interesting
scheme is to rely as much as possible on learning in the
feature extractor itself. In the case of character recognition,
a network could be fed with almost raw inputs (e.g.,
size-normalized images). While this can be done with an
ordinary fully connected feedforward network with some
success for tasks such as character recognition, there are
problems.
First, typical images are large, often with several hundred
variables (pixels). A fully connected first layer with, e.g.,
one hundred hidden units in the first layer would already
contain several tens of thousands of weights. Such a large
number of parameters increases the capacity of the system
and therefore requires a larger training set. In addition, the
memory requirement to store so many weights may rule out
certain hardware implementations. But the main deficiency
of unstructured nets for image or speech applications is that
they have no built-in invariance with respect to translations
or local distortions of the inputs. Before being sent to
the fixed-size input layer of an NN, character images,
2282 PROCEEDINGS OF THE IEEE, VOL. 86, NO. 11, NOVEMBER 1998
剩余46页未读,继续阅读

















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0