没有合适的资源？快使用搜索试试~ 我知道了~

首页Lecun1998.pdf

资源详情

资源评论

资源推荐

Gradient-Based Learning Applied

to Document Recognition

YANN LECUN, MEMBER, IEEE, L

´

EON BOTTOU, YOSHUA BENGIO, AND

PATRICK HAFFNER

Invited Paper

Multilayer neural networks trained with the back-propagation

algorithm constitute the best example of a successful gradient-

based learning technique. Given an appropriate network

architecture, gradient-based learning algorithms can be used

to synthesize a complex decision surface that can classify

high-dimensional patterns, such as handwritten characters, with

minimal preprocessing. This paper reviews various methods

applied to handwritten character recognition and compares them

on a standard handwritten digit recognition task. Convolutional

neural networks, which are speciﬁcally designed to deal with

the variability of two dimensional (2-D) shapes, are shown to

outperform all other techniques.

Real-life document recognition systems are composed of multiple

modules including ﬁeld extraction, segmentation, recognition,

and language modeling. A new learning paradigm, called graph

transformer networks (GTN’s), allows such multimodule systems

to be trained globally using gradient-based methods so as to

minimize an overall performance measure.

Two systems for online handwriting recognition are described.

Experiments demonstrate the advantage of global training, and

the ﬂexibility of graph transformer networks.

A graph transformer network for reading a bank check is

also described. It uses convolutional neural network character

recognizers combined with global training techniques to provide

record accuracy on business and personal checks. It is deployed

commercially and reads several million checks per day.

Keywords— Convolutional neural networks, document recog-

nition, ﬁnite state transducers, gradient-based learning, graph

transformer networks, machine learning, neural networks, optical

character recognition (OCR).

NOMENCLATURE

GT Graph transformer.

GTN Graph transformer network.

HMM Hidden Markov model.

HOS Heuristic oversegmentation.

K-NN K-nearest neighbor.

Manuscript received November 1, 1997; revised April 17, 1998.

Y. LeCun, L. Bottou, and P. Haffner are with the Speech and Image

Processing Services Research Laboratory, AT&T Labs-Research, Red

Bank, NJ 07701 USA.

Y. Bengio is with the D

´

epartement d’Informatique et de Recherche

Op

´

erationelle, Universit

´

e de Montr

´

eal, Montr

´

eal, Qu

´

ebec H3C 3J7 Canada.

Publisher Item Identiﬁer S 0018-9219(98)07863-3.

NN Neural network.

OCR Optical character recognition.

PCA Principal component analysis.

RBF Radial basis function.

RS-SVM Reduced-set support vector method.

SDNN Space displacement neural network.

SVM Support vector method.

TDNN Time delay neural network.

V-SVM Virtual support vector method.

I. I

NTRODUCTION

Over the last several years, machine learning techniques,

particularly when applied to NN’s, have played an increas-

ingly important role in the design of pattern recognition

systems. In fact, it could be argued that the availability

of learning techniques has been a crucial factor in the

recent success of pattern recognition applications such as

continuous speech recognition and handwriting recognition.

The main message of this paper is that better pattern

recognition systems can be built by relying more on auto-

matic learning and less on hand-designed heuristics. This

is made possible by recent progress in machine learning

and computer technology. Using character recognition as a

case study, we show that hand-crafted feature extraction can

be advantageously replaced by carefully designed learning

machines that operate directly on pixel images. Using

document understanding as a case study, we show that the

traditional way of building recognition systems by manually

integrating individually designed modules can be replaced

by a uniﬁed and well-principled design paradigm, called

GTN’s, which allows training all the modules to optimize

a global performance criterion.

Since the early days of pattern recognition it has been

known that the variability and richness of natural data,

be it speech, glyphs, or other types of patterns, make it

almost impossible to build an accurate recognition system

entirely by hand. Consequently, most pattern recognition

systems are built using a combination of automatic learning

techniques and hand-crafted algorithms. The usual method

0018–9219/98$10.00 1998 IEEE

2278 PROCEEDINGS OF THE IEEE, VOL. 86, NO. 11, NOVEMBER 1998

Fig. 1. Traditional pattern recognition is performed with two

modules: a ﬁxed feature extractor and a trainable classiﬁer.

of recognizing individual patterns consists in dividing the

system into two main modules shown in Fig. 1. The ﬁrst

module, called the feature extractor, transforms the input

patterns so that they can be represented by low-dimensional

vectors or short strings of symbols that: 1) can be easily

matched or compared and 2) are relatively invariant with

respect to transformations and distortions of the input pat-

terns that do not change their nature. The feature extractor

contains most of the prior knowledge and is rather speciﬁc

to the task. It is also the focus of most of the design effort,

because it is often entirely hand crafted. The classiﬁer,

on the other hand, is often general purpose and trainable.

One of the main problems with this approach is that the

recognition accuracy is largely determined by the ability of

the designer to come up with an appropriate set of features.

This turns out to be a daunting task which, unfortunately,

must be redone for each new problem. A large amount of

the pattern recognition literature is devoted to describing

and comparing the relative merits of different feature sets

for particular tasks.

Historically, the need for appropriate feature extractors

was due to the fact that the learning techniques used

by the classiﬁers were limited to low-dimensional spaces

with easily separable classes [1]. A combination of three

factors has changed this vision over the last decade. First,

the availability of low-cost machines with fast arithmetic

units allows for reliance on more brute-force “numerical”

methods than on algorithmic reﬁnements. Second, the avail-

ability of large databases for problems with a large market

and wide interest, such as handwriting recognition, has

enabled designers to rely more on real data and less on

hand-crafted feature extraction to build recognition systems.

The third and very important factor is the availability

of powerful machine learning techniques that can handle

high-dimensional inputs and can generate intricate decision

functions when fed with these large data sets. It can be

argued that the recent progress in the accuracy of speech

and handwriting recognition systems can be attributed in

large part to an increased reliance on learning techniques

and large training data sets. As evidence of this fact, a large

proportion of modern commercial OCR systems use some

form of multilayer NN trained with back propagation.

In this study, we consider the tasks of handwritten

character recognition (Sections I and II) and compare the

performance of several learning techniques on a benchmark

data set for handwritten digit recognition (Section III).

While more automatic learning is beneﬁcial, no learning

technique can succeed without a minimal amount of prior

knowledge about the task. In the case of multilayer NN’s,

a good way to incorporate knowledge is to tailor its archi-

tecture to the task. Convolutional NN’s [2], introduced in

Section II, are an example of specialized NN architectures

which incorporate knowledge about the invariances of two-

dimensional (2-D) shapes by using local connection patterns

and by imposing constraints on the weights. A comparison

of several methods for isolated handwritten digit recogni-

tion is presented in Section III. To go from the recognition

of individual characters to the recognition of words and

sentences in documents, the idea of combining multiple

modules trained to reduce the overall error is introduced

in Section IV. Recognizing variable-length objects such as

handwritten words using multimodule systems is best done

if the modules manipulate directed graphs. This leads to the

concept of trainable GTN, also introduced in Section IV.

Section V describes the now classical method of HOS for

recognizing words or other character strings. Discriminative

and nondiscriminative gradient-based techniques for train-

ing a recognizer at the word level without requiring manual

segmentation and labeling are presented in Section VI.

Section VII presents the promising space-displacement NN

approach that eliminates the need for segmentation heuris-

tics by scanning a recognizer at all possible locations on

the input. In Section VIII, it is shown that trainable GTN’s

can be formulated as multiple generalized transductions

based on a general graph composition algorithm. The

connections between GTN’s and HMM’s, commonly used

in speech recognition, is also treated. Section IX describes

a globally trained GTN system for recognizing handwriting

entered in a pen computer. This problem is known as

“online” handwriting recognition since the machine must

produce immediate feedback as the user writes. The core

of the system is a convolutional NN. The results clearly

demonstrate the advantages of training a recognizer at

the word level, rather than training it on presegmented,

hand-labeled, isolated characters. Section X describes a

complete GTN-based system for reading handwritten and

machine-printed bank checks. The core of the system is

the convolutional NN called LeNet-5, which is described

in Section II. This system is in commercial use in the

NCR Corporation line of check recognition systems for the

banking industry. It is reading millions of checks per month

in several banks across the United States.

A. Learning from Data

There are several approaches to automatic machine learn-

ing, but one of the most successful approaches, popularized

in recent years by the NN community, can be called “nu-

merical” or gradient-based learning. The learning machine

computes a function

where is the th

input pattern, and

represents the collection of adjustable

parameters in the system. In a pattern recognition setting,

LECUN et al.: GRADIENT-BASED LEARNING APPLIED TO DOCUMENT RECOGNITION 2279

the output may be interpreted as the recognized class

label of pattern

or as scores or probabilities associated

with each class. A loss function

measures the discrepancy between the “correct” or

desired output for pattern

and the output produced by

the system. The average loss function

is the

average of the errors

over a set of labeled examples

called the training set

In the

simplest setting, the learning problem consists in ﬁnding

the value of

that minimizes In practice,

the performance of the system on a training set is of little

interest. The more relevant measure is the error rate of the

system in the ﬁeld, where it would be used in practice.

This performance is estimated by measuring the accuracy

on a set of samples disjoint from the training set, which is

called the test set. Much theoretical and experimental work

[3]–[5] has shown that the gap between the expected error

rate on the test set

and the error rate on the training

set

decreases with the number of training samples

approximately as

(1)

where

is the number of training samples, is a measure

of “effective capacity” or complexity of the machine [6],

[7],

is a number between 0.5 and 1.0, and is a constant.

This gap always decreases when the number of training

samples increases. Furthermore, as the capacity

increases,

decreases. Therefore, when increasing the capacity

there is a tradeoff between the decrease of and the

increase of the gap, with an optimal value of the capacity

that achieves the lowest generalization error Most

learning algorithms attempt to minimize

as well as

some estimate of the gap. A formal version of this is called

structural risk minimization [6], [7], and it is based on deﬁn-

ing a sequence of learning machines of increasing capacity,

corresponding to a sequence of subsets of the parameter

space such that each subset is a superset of the previous

subset. In practical terms, structural risk minimization is

implemented by minimizing

where the

function

is called a regularization function and is

a constant.

is chosen such that it takes large values

on parameters

that belong to high-capacity subsets of

the parameter space. Minimizing

in effect limits the

capacity of the accessible subset of the parameter space,

thereby controlling the tradeoff between minimizing the

training error and minimizing the expected gap between

the training error and test error.

B. Gradient-Based Learning

The general problem of minimizing a function with

respect to a set of parameters is at the root of many

issues in computer science. Gradient-based learning draws

on the fact that it is generally much easier to minimize

a reasonably smooth, continuous function than a discrete

(combinatorial) function. The loss function can be mini-

mized by estimating the impact of small variations of the

parameter values on the loss function. This is measured

by the gradient of the loss function with respect to the

parameters. Efﬁcient learning algorithms can be devised

when the gradient vector can be computed analytically (as

opposed to numerically through perturbations). This is the

basis of numerous gradient-based learning algorithms with

continuous-valued parameters. In the procedures described

in this article, the set of parameters

is a real-valued

vector, with respect to which

is continuous, as well

as differentiable almost everywhere. The simplest mini-

mization procedure in such a setting is the gradient descent

algorithm where

is iteratively adjusted as follows:

(2)

In the simplest case,

is a scalar constant. More sophis-

ticated procedures use variable

or substitute it for a

diagonal matrix, or substitute it for an estimate of the

inverse Hessian matrix as in Newton or quasi-Newton

methods. The conjugate gradient method [8] can also be

used. However, Appendix B shows that despite many

claims to the contrary in the literature, the usefulness of

these second-order methods to large learning machines is

very limited.

A popular minimization procedure is the stochastic gra-

dient algorithm, also called the online update. It consists

in updating the parameter vector using a noisy, or approxi-

mated, version of the average gradient. In the most common

instance of it,

is updated on the basis of a single sample

(3)

With this procedure the parameter vector ﬂuctuates around

an average trajectory, but usually it converges considerably

faster than regular gradient descent and second-order meth-

ods on large training sets with redundant samples (such

as those encountered in speech or character recognition).

The reasons for this are explained in Appendix B. The

properties of such algorithms applied to learning have been

studied theoretically since the 1960’s [9]–[11], but practical

successes for nontrivial tasks did not occur until the mid

eighties.

C. Gradient Back Propagation

Gradient-based learning procedures have been used since

the late 1950’s, but they were mostly limited to linear

systems [1]. The surprising usefulness of such simple

gradient descent techniques for complex machine learning

tasks was not widely realized until the following three

events occurred. The ﬁrst event was the realization that,

despite early warnings to the contrary [12], the presence of

local minima in the loss function does not seem to be a

major problem in practice. This became apparent when it

was noticed that local minima did not seem to be a major

impediment to the success of early nonlinear gradient-based

learning techniques such as Boltzmann machines [13], [14].

The second event was the popularization by Rumelhart et

al. [15] and others of a simple and efﬁcient procedure

to compute the gradient in a nonlinear system composed

2280 PROCEEDINGS OF THE IEEE, VOL. 86, NO. 11, NOVEMBER 1998

of several layers of processing, i.e., the back-propagation

algorithm. The third event was the demonstration that the

back-propagation procedure applied to multilayer NN’s

with sigmoidal units can solve complicated learning tasks.

The basic idea of back propagation is that gradients can

be computed efﬁciently by propagation from the output to

the input. This idea was described in the control theory

literature of the early 1960’s [16], but its application to ma-

chine learning was not generally realized then. Interestingly,

the early derivations of back propagation in the context

of NN learning did not use gradients but “virtual targets”

for units in intermediate layers [17], [18], or minimal

disturbance arguments [19]. The Lagrange formalism used

in the control theory literature provides perhaps the best

rigorous method for deriving back propagation [20] and for

deriving generalizations of back propagation to recurrent

networks [21] and networks of heterogeneous modules [22].

A simple derivation for generic multilayer systems is given

in Section I-E.

The fact that local minima do not seem to be a problem

for multilayer NN’s is somewhat of a theoretical mystery.

It is conjectured that if the network is oversized for the

task (as is usually the case in practice), the presence of

“extra dimensions” in parameter space reduces the risk

of unattainable regions. Back propagation is by far the

most widely used neural-network learning algorithm, and

probably the most widely used learning algorithm of any

form.

D. Learning in Real Handwriting Recognition Systems

Isolated handwritten character recognition has been ex-

tensively studied in the literature (see [23] and [24] for

reviews), and it was one of the early successful applications

of NN’s [25]. Comparative experiments on recognition of

individual handwritten digits are reported in Section III.

They show that NN’s trained with gradient-based learning

perform better than all other methods tested here on the

same data. The best NN’s, called convolutional networks,

are designed to learn to extract relevant features directly

from pixel images (see Section II).

One of the most difﬁcult problems in handwriting recog-

nition, however, is not only to recognize individual charac-

ters, but also to separate out characters from their neighbors

within the word or sentence, a process known as seg-

mentation. The technique for doing this that has become

the “standard” is called HOS. It consists of generating a

large number of potential cuts between characters using

heuristic image processing techniques, and subsequently

selecting the best combination of cuts based on scores

given for each candidate character by the recognizer. In

such a model, the accuracy of the system depends upon the

quality of the cuts generated by the heuristics, and on the

ability of the recognizer to distinguish correctly segmented

characters from pieces of characters, multiple characters,

or otherwise incorrectly segmented characters. Training a

recognizer to perform this task poses a major challenge

because of the difﬁculty in creating a labeled database

of incorrectly segmented characters. The simplest solution

consists of running the images of character strings through

the segmenter and then manually labeling all the character

hypotheses. Unfortunately, not only is this an extremely

tedious and costly task, it is also difﬁcult to do the labeling

consistently. For example, should the right half of a cut-up

four be labeled as a one or as a noncharacter? Should the

right half of a cut-up eight be labeled as a three?

The ﬁrst solution, described in Section V, consists of

training the system at the level of whole strings of char-

acters rather than at the character level. The notion of

gradient-based learning can be used for this purpose. The

system is trained to minimize an overall loss function which

measures the probability of an erroneous answer. Section V

explores various ways to ensure that the loss function

is differentiable and therefore lends itself to the use of

gradient-based learning methods. Section V introduces the

use of directed acyclic graphs whose arcs carry numerical

information as a way to represent the alternative hypotheses

and introduces the idea of GTN.

The second solution, described in Section VII, is to

eliminate segmentation altogether. The idea is to sweep

the recognizer over every possible location on the input

image, and to rely on the “character spotting” property

of the recognizer, i.e., its ability to correctly recognize

a well-centered character in its input ﬁeld, even in the

presence of other characters besides it, while rejecting

images containing no centered characters [26], [27]. The

sequence of recognizer outputs obtained by sweeping the

recognizer over the input is then fed to a GTN that takes

linguistic constraints into account and ﬁnally extracts the

most likely interpretation. This GTN is somewhat similar

to HMM’s, which makes the approach reminiscent of the

classical speech recognition [28], [29]. While this technique

would be quite expensive in the general case, the use of

convolutional NN’s makes it particularly attractive because

it allows signiﬁcant savings in computational cost.

E. Globally Trainable Systems

As stated earlier, most practical pattern recognition sys-

tems are composed of multiple modules. For example, a

document recognition system is composed of a ﬁeld loca-

tor (which extracts regions of interest), a ﬁeld segmenter

(which cuts the input image into images of candidate

characters), a recognizer (which classiﬁes and scores each

candidate character), and a contextual postprocessor, gen-

erally based on a stochastic grammar (which selects the

best grammatically correct answer from the hypotheses

generated by the recognizer). In most cases, the information

carried from module to module is best represented as

graphs with numerical information attached to the arcs.

For example, the output of the recognizer module can be

represented as an acyclic graph where each arc contains the

label and the score of a candidate character, and where each

path represents an alternative interpretation of the input

string. Typically, each module is manually optimized, or

sometimes trained, outside of its context. For example, the

character recognizer would be trained on labeled images

of presegmented characters. Then the complete system is

LECUN et al.: GRADIENT-BASED LEARNING APPLIED TO DOCUMENT RECOGNITION 2281

assembled, and a subset of the parameters of the modules

is manually adjusted to maximize the overall performance.

This last step is extremely tedious, time consuming, and

almost certainly suboptimal.

A better alternative would be to somehow train the entire

system so as to minimize a global error measure such

as the probability of character misclassiﬁcations at the

document level. Ideally, we would want to ﬁnd a good

minimum of this global loss function with respect to all the

parameters in the system. If the loss function

measuring

the performance can be made differentiable with respect

to the system’s tunable parameters

we can ﬁnd a local

minimum of

using gradient-based learning. However, at

ﬁrst glance, it appears that the sheer size and complexity

of the system would make this intractable.

To ensure that the global loss function

is

differentiable, the overall system is built as a feedforward

network of differentiable modules. The function imple-

mented by each module must be continuous and differ-

entiable almost everywhere with respect to the internal

parameters of the module (e.g., the weights of an NN

character recognizer in the case of a character recognition

module), and with respect to the module’s inputs. If this is

the case, a simple generalization of the well-known back-

propagation procedure can be used to efﬁciently compute

the gradients of the loss function with respect to all the

parameters in the system [22]. For example, let us consider

a system built as a cascade of modules, each of which

implements a function

where

is a vector representing the output of the module, is

the vector of tunable parameters in the module (a subset of

and is the module’s input vector (as well as the

previous module’s output vector). The input

to the ﬁrst

module is the input pattern

If the partial derivative of

with respect to is known, then the partial derivatives

of

with respect to and can be computed using

the backward recurrence

(4)

where

is the Jacobian of with

respect to

evaluated at the point and

is the Jacobian of with respect to

The Jacobian of a vector function is a matrix containing

the partial derivatives of all the outputs with respect to

all the inputs. The ﬁrst equation computes some terms

of the gradient of

while the second equation

generates a backward recurrence, as in the well-known

back-propagation procedure for NN’s. We can average

the gradients over the training patterns to obtain the full

gradient. It is interesting to note that in many instances

there is no need to explicitly compute the Jacobian ma-

trix. The above formula uses the product of the Jacobian

with a vector of partial derivatives, and it is often easier

to compute this product directly without computing the

Jacobian beforehand. In analogy with ordinary multilayer

NN’s, all but the last module are called hidden layers

because their outputs are not observable from the outside.

In more complex situations than the simple cascade of

modules described above, the partial derivative notation

becomes somewhat ambiguous and awkward. A completely

rigorous derivation in more general cases can be done using

Lagrange functions [20]–[22].

Traditional multilayer NN’s are a special case of the

above where the state information

is represented

with ﬁxed-sized vectors, and where the modules are

alternated layers of matrix multiplications (the weights)

and component-wise sigmoid functions (the neurons).

However, as stated earlier, the state information in complex

recognition system is best represented by graphs with

numerical information attached to the arcs. In this case,

each module, called a GT, takes one or more graphs as input

and produces a graph as output. Networks of such modules

are called GTN’s. Sections IV, VI, and VIII develop the

concept of GTN’s and show that gradient-based learning

can be used to train all the parameters in all the modules

so as to minimize a global loss function. It may seem

paradoxical that gradients can be computed when the state

information is represented by essentially discrete objects

such as graphs, but that difﬁculty can be circumvented,

as shown later.

II. C

ONVOLUTIONAL NEURAL NETWORKS FOR

ISOLATED CHARACTER RECOGNITION

The ability of multilayer networks trained with gradi-

ent descent to learn complex, high-dimensional, nonlinear

mappings from large collections of examples makes them

obvious candidates for image recognition tasks. In the

traditional model of pattern recognition, a hand-designed

feature extractor gathers relevant information from the input

and eliminates irrelevant variabilities. A trainable classiﬁer

then categorizes the resulting feature vectors into classes. In

this scheme, standard, fully connected multilayer networks

can be used as classiﬁers. A potentially more interesting

scheme is to rely as much as possible on learning in the

feature extractor itself. In the case of character recognition,

a network could be fed with almost raw inputs (e.g.,

size-normalized images). While this can be done with an

ordinary fully connected feedforward network with some

success for tasks such as character recognition, there are

problems.

First, typical images are large, often with several hundred

variables (pixels). A fully connected ﬁrst layer with, e.g.,

one hundred hidden units in the ﬁrst layer would already

contain several tens of thousands of weights. Such a large

number of parameters increases the capacity of the system

and therefore requires a larger training set. In addition, the

memory requirement to store so many weights may rule out

certain hardware implementations. But the main deﬁciency

of unstructured nets for image or speech applications is that

they have no built-in invariance with respect to translations

or local distortions of the inputs. Before being sent to

the ﬁxed-size input layer of an NN, character images,

2282 PROCEEDINGS OF THE IEEE, VOL. 86, NO. 11, NOVEMBER 1998

剩余46页未读，继续阅读

安全验证

文档复制为VIP权益，开通VIP直接复制

信息提交成功

## 评论0