深度学习在文档识别中的应用

需积分: 50 183 浏览量更新于2024-07-20 收藏 982KB PDF 举报

"Gradient-Based Learning Applied to Document Recognition" 这篇论文深入探讨了基于梯度的学习方法在文档识别中的应用，特别是强调了LeNet-5在这一领域的贡献。LeNet-5是多层神经网络的一个里程碑，它利用反向传播算法进行训练，这种算法是梯度基础学习技术的一个成功实例。论文指出，只要网络架构合适，基于梯度的学习算法就能构建出复杂的决策表面，对高维模式（如手写字符）进行分类，而且几乎无需预处理。文档识别是一个涉及多个步骤和模块的过程，其中包括字段提取、图像预处理、特征提取和分类等。在各种手写字符识别方法的比较中，卷积神经网络（CNN）脱颖而出。CNN专门设计用于处理形状的多样性，其结构允许它们对局部特征进行学习，从而更有效地识别手写数字和其他复杂图案。论文中提到的卷积神经网络具有几个关键组成部分：卷积层、池化层和全连接层。卷积层通过可学习的滤波器对输入图像进行扫描，检测和学习特定特征；池化层则降低了数据的空间维度，减少了计算量，同时保持了关键信息；全连接层将前几层提取的特征映射到最终的分类输出。 LeNet-5在手写字符识别上的成功在于它的层次结构，每一层都负责学习不同级别的特征。较低层可能学习边缘和简单形状，而较高层则学习更复杂的特征，如笔画的组合。这种分层学习使得模型能够逐步理解并识别出手写字符。除了CNN，论文还可能涵盖了其他传统方法，如模板匹配、支持向量机（SVM）或传统的机器学习算法，并与CNN进行了性能比较。它强调了CNN在处理图像识别任务时的优越性，特别是在处理具有内在变化和变形的数据集时。此外，实际的文档识别系统通常还包括OCR（光学字符识别）技术，以及错误检测和校正机制，以提高整体系统的准确性和鲁棒性。这些组件协同工作，确保文档内容能被准确地转换成机器可读的形式。 "Gradient-Based Learning Applied to Document Recognition"这篇论文详尽地探讨了基于梯度学习的方法在手写字符识别中的应用，尤其是卷积神经网络LeNet-5的原理和优势，对于理解深度学习在文档处理领域的应用具有重要价值。

PROC. OF THE IEEE, NOVEMBER 1998 8

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 X X X X X X X X X X

1 X X X X X X X X X X

2 X X X X X X X X X X

3 X X X X X X X X X X

4 X X X X X X X X X X

5 X X X X X X X X X X

TABLE I

Each column indicates which feature map in S2 are combined

by the units in a particular feature map of C3.

combined byeach C3 feature map. Why not connect ev-

ery S2 feature map to every C3 feature map? The rea-

son is twofold. First, a non-complete connection scheme

keeps the numb er of connections within reasonable bounds.

More importantly, it forces a break of symmetry in the net-

work. Dierent feature maps are forced to extract dierent

(hopefully complementary) features b ecause they get dif-

ferent sets of inputs. The rationale behind the connection

scheme in table I is the following. The rst six C3 feature

maps take inputs from every contiguous subsets of three

feature maps in S2. The next six take input from every

contiguous subset of four. The next three take input from

some discontinuous subsets of four. Finally the last one

takes input from all S2 feature maps. Layer C3 has 1,516

trainable parameters and 151,600 connections.

Layer S4 is a sub-sampling layer with 16 feature maps of

size 5x5. Eachunitineach feature map is connected to a

2x2 neighb orhood in the corresponding feature map in C3,

in a similar way as C1 and S2. Layer S4 has 32 trainable

parameters and 2,000 connections.

Layer C5 is a convolutional layer with 120 feature maps.

Each unit is connected to a 5x5 neighborhood on all 16

of S4's feature maps. Here, because the size of S4 is also

5x5, the size of C5's feature maps is 1x1: this amounts

to a full connection between S4 and C5. C5 is labeled

as a convolutional layer, instead of a fully-connected layer,

because if LeNet-5 input were made bigger with everything

else kept constant, the feature map dimension would be

larger than 1x1. This process of dynamically increasing the

size of a convolutional network is describ ed in the section

Section VI I. Layer C5 has 48,120 trainable connections.

Layer F6, contains 84 units (the reason for this number

comes from the design of the output layer, explained be-

low) and is fully connected to C5. It has 10,164 trainable

parameters.

As in classical neural networks, units in layers up to F6

compute a dot pro duct between their input vector and their

weightvector, to whichabiasisadded. This weighted sum,

denoted

for unit

, is then passed through a sigmoid

squashing function to produce the state of unit

, denoted

(

) (5)

The squashing function is a scaled hyperbolic tangent:

(

tanh(

) (6)

where

is the amplitude of the function and

determines

its slop e at the origin. The function

is o dd, with horizon-

tal asymptotes at +

and

;

. The constant

is chosen

to b e 1

7159. The rationale for this choice of a squashing

function is given in Appendix A.

Finally, the output layer is composed of Euclidean Radial

Basis Function units (RBF), one for each class, with 84

inputs each. The outputs of each RBF unit

is computed

as follows:

(

;

)

(7)

In other words, each output RBF unit computes the Eu-

clidean distance b etween its input vector and its parameter

vector. The further away is the input from the parameter

vector, the larger is the RBF output. The output of a

particular RBF can b e interpreted as a penalty term mea-

suring the t between the input pattern and a mo del of the

class associated with the RBF. In probabilistic terms, the

RBF output can b e in

terpreted as the unnormalized nega-

tive log-likeliho od of a Gaussian distribution in the space

of congurations of layer F6. Given an input pattern, the

loss function should b e designed so as to get the congu-

ration of F6 as close as p ossible to the parameter vector

of the RBF that corresponds to the pattern's desired class.

The parameter vectors of these units were chosen byhand

and kept xed (at least initially). The components of those

parameters vectors were set to -1 or +1. While they could

have been chosen at random with equal probabilities for -1

and +1, or even chosen to form an error correcting co de

as suggested by 47], they were instead designed to repre-

sentastylized image of the corresp onding character class

drawn on a 7x12 bitmap (hence the number 84). Such a

representation is not particularly useful for recognizing iso-

lated digits, but it is quite useful for recognizing strings of

characters taken from the full printable ASCI I set. The

rationale is that characters that are similar, and therefore

confusable, suchasuppercaseO,lowercase O, and zero, or

lowercase l, digit 1, square brackets, and uppercase I, will

ve similar output codes. This is particularly useful if the

system is combined with a linguistic p ost-processor that

can correct such confusions. Because the codes for confus-

able classes are similar, the output of the corresponding

RBFs for an ambiguous character will b e similar, and the

post-pro cessor will b e able to pick the appropriate interpre-

tation. Figure 3 gives the output codes for the full ASCI I

set.

Another reason for using such distributed co des, rather

than the more common \1 of N" co de (also called place

code, or grand-mother cell code) for the outputs is that

non distributed codes tend to behave badly when the num-

ber of classes is larger than a few dozens. The reason is

that output units in a non-distributed co de must be o

most of the time. This is quite dicult to achieve with

sigmoid units. Yet another reason is that the classiers are

often used to not only recognize c

haracters, but also to re-

ject non-characters. RBFs with distributed codes are more

appropriate for that purp ose b ecause unlike sigmoids, they

are activated within a well circumscrib ed region of their in-

PROC. OF THE IEEE, NOVEMBER 1998 9

! " # $ % & ’ ( ) * + , − . /

0 1 2 3 4 5 6 7 8 9 : ; < = > ?

@ A B C D E F G H I J K L M N O

P Q R S T U V W X Y Z [ \ ] ^ _

‘ a b c d e f g h i j k l m n o

p q r s t u v w x y z { | } ~

Fig. 3. Initial parameters of the output RBFs for recognizing the

full ASCI I set.

put space that non-typical patterns are more likely to fall

outside of.

The parameter vectors of the RBFs play the role of target

vectors for layer F6. It is worth pointing out that the com-

ponents of those vectors are +1 or -1, whichiswell within

the range of the sigmoid of F6, and therefore prevents those

sigmoids from getting saturated. In fact, +1 and -1 are the

points of maximum curvature of the sigmoids. This forces

the F6 units to op erate in their maximally non-linear range.

Saturation of the sigmoids must be avoided b ecause it is

known to lead to slow convergence and ill-conditioning of

the loss function.

C. Loss Function

The simplest output loss function that can be used with

the ab ovenetwork is the Maximum Likeliho od Estimation

criterion (MLE), which in our case is equivalent to the Min-

imum Mean Squared Error (MSE). The criterion for a set

of training samples is simply:

(

W

) (8)

where

is the output of the

-th RBF unit, i.e. the

one that corresp onds to the correct class of input pattern

. While this cost function is appropriate for most cases,

it lacks three imp ortant properties. First, if weallowthe

parameters of the RBF to adapt,

(

) has a trivial, but

totally unacceptable, solution. In this solution, all the RBF

parameter vectors are equal, and the state of F6 is constant

and equal to that parameter vector. In this case the net-

work happily ignores the input, and all the RBF outputs

are equal to zero. This collapsing phenomenon do es not

occur if the RBF weights are not allowed to adapt. The

second problem is that there is no competition between

the classes. Suc

h a competition can be obtained by us-

ing a more discriminative training criterion, dubbed the

MAP (maximum a p osteriori) criterion, similar to Maxi-

mum Mutual Information criterion sometimes used to train

HMMs 48], 49], 50]. It corresp onds to maximizing the

posterior probability of the correct class

(or minimiz-

ing the logarithm of the probabilityofthe correct class),

given that the input image can come from one of the classes

or from a background \rubbish" class lab el. In terms of

penalties, it means that in addition to pushing down the

penalty of the correct class like the MSE criterion, this

criterion also pulls up the penalties of the incorrect classes:

(

W

) + log(

;

(

W

)

))

(9)

The negative of the second term plays a \comp etitive" role.

It is necessarily smaller than (or equal to) the rst term,

therefore this loss function is positive. The constant

positive, and prevents the penalties of classes that are al-

ready very large from b eing pushed further up. The p os-

terior probabilityofthis rubbish class lab el would be the

ratio of

;

and

;

(

W

)

. This discrimina-

tive criterion prevents the previously mentioned \collaps-

ing eect" when the RBF parameters are learned because

it keeps the RBF centers apart from each other. In Sec-

tion VI, we present a generalization of this criterion for

systems that learn to classify multiple ob jects in the input

(e.g., characters in words or in do cuments).

Computing the gradient of the loss function with respect

to all the weights in all the layers of the convolutional

network is done with back-propagation. The standard al-

gorithm must be slightly mo died to take accountofthe

weight sharing. An easy way to implement it is to rst com-

pute the partial derivatives of the loss function with resp ect

to each

connection

, as if the network were a conventional

multi-layer network without weight sharing. Then the par-

tial derivatives of all the connections that share a same

parameter are added to form the derivative with respect to

that parameter.

Such a large architecture can b e trained very eciently,

but doing so requires the use of a few techniques that are

described in the app endix. Section A of the app endix

describes details such as the particular sigmoid used, and

the weight initialization. Section B and C describe the

minimization procedure used, whichisastochastic version

of a diagonal approximation to the Levenberg-Marquardt

procedure.

III. Results and Comparison with Other

Methods

While recognizing individual digits is only one of many

problems involved in designing a practical recognition sys-

tem, it is an excellent benchmark for comparing shape

recognition methods. Though many existing method com-

bine a hand-crafted feature extractor and a trainable clas-

sier, this study concentrates on adaptive metho ds that

operate directly on size-normalized images.

A. Database: the ModiedNISTset

The

database used to train and test the systems de-

scribed in this paper was constructed from the NIST's Spe-

cial Database 3 and Special Database 1 containing binary

images of handwritten digits. NIST originally designated

SD-3 as their training set and SD-1 as their test set. How-

ever, SD-3 is much cleaner and easier to recognize than SD-

1. The reason for this can b e found on the fact that SD-3

PROC. OF THE IEEE, NOVEMBER 1998 10

was collected among Census Bureau employees, while SD-1

was collected among high-school students. Drawing sensi-

ble conclusions from learning experiments requires that the

result be indep endent of the choice of training set and test

among the complete set of samples. Therefore it was nec-

essary to build a new database by mixing NIST's datasets.

SD-1 contains 58,527 digit images written by 500 dif-

ferent writers. In contrast to SD-3, where blocks of data

from each writer appeared in sequence, the data in SD-1 is

scrambled. Writer identities for SD-1 are available and we

used this information to unscramble the writers. Wethen

split SD-1 in two: characters written by the rst 250 writers

wentinto our new training set. The remaining 250 writers

were placed in our test set. Thus we had two sets with

nearly 30,000 examples each. The new training set was

completed with enough examples from SD-3, starting at

pattern # 0, to make a full set of 60,000 training patterns.

Similarly, the new test set was completed with SD-3 exam-

ples starting at pattern # 35,000 to makeafull set with

60,000 test patterns. In the exp eriments describ ed here, we

only used a subset of 10,000 test images (5,000 from SD-1

and 5,000 from SD-3), but we used the full 60,000 training

samples. The resulting database was called the Mo died

NIST, or MNIST, dataset.

The original black and white (bilevel) images were size

normalized to t in a 20x20 pixel box while preserving

their asp ect ratio. The resulting images contain grey lev-

els as result of the anti-aliasing (image interpolation) tech-

nique used by the normalization algorithm. Three ver-

sions of the database were used. In the rst version,

the images were centered in a 28x28 image by comput-

ing the center

of mass of the pixels, and translating the

image so as to position this point at the center of the

28x28 eld. In some instances, this 28x28 eld was ex-

tended to 32x32 with background pixels. This version of

the database will be referred to as the

regular

database.

In the second version of the database, the character im-

ages were deslanted and cropped down to 20x20 pixels im-

ages. The deslanting computes the second moments of in-

ertia of the pixels (counting a foreground pixel as 1 and a

background pixel as 0), and shears the image by horizon-

tally shifting the lines so that the principal axis is verti-

cal. This version of the database will b e referred to as the

deslanted

database. In the third

version of the database,

used in some early exp eriments, the images were reduced

to 16x16 pixels. The regular database (60,000 training

examples, 10,000 test examples size-normalized to 20x20,

and centered by center of mass in 28x28 elds) is avail-

able at

http://www.research.att.com/

yann/ocr/mnist

Figure 4 shows examples randomly picked from the test set.

B. Results

Several versions of LeNet-5 were trained on the regular

MNIST database. 20 iterations through the entire train-

ing data were p erformed for each session. The values of

the global learning rate



(see Equation 21 in App endix C

for a denition) was decreased using the following sched-

ule:

0.0005 for the rst two passes, 0.0002 for the next

Fig. 4. Size-normalized examples from the MNIST database.

three, 0.0001 for the next three, 0.00005 for the next 4,

and 0.00001 thereafter. Before each iteration, the diagonal

Hessian approximation was reevaluated on 500 samples, as

described in App endix C and kept xed during the entire

iteration. The parameter



wassetto0.02. The resulting

eective learning rates during the rst pass varied between

approximately 7



;

and 0

016 over the set of parame-

ters. The test error rate stabilizes after around 10 passes

through the training set at 0.95%. The error rate on the

training set reaches 0.35% after 19 passes. Many authors

have reported observing the common phenomenon of over-

training when training neural networks or other adaptive

algorithms on various tasks. When over-training occurs,

the training error keeps decreasing over time, but the test

error go es through a minimum and starts increasing after

a certain number of iterations. While this phenomenon is

very common, it was not observed in our case as the learn-

ing curves in gure 5 show. A possible reason is that the

learning rate was k

ept relatively large. The eect of this is

that the weights never settle down in the lo cal minimum

but keep oscillating randomly. Because of those uctua-

tions, the average cost will be lower in a broader minimum.

Therefore, stochastic gradient will have a similar eect as

a regularization term that favors broader minima. Broader

minima corresp ond to solutions with large entropyofthe

parameter distribution, which is benecial to the general-

ization error.

The inuence of the training set size was measured by

training the network with 15,000, 30,000, and 60,000 exam-

ples. The resulting training error and test error are shown

in gure 6. It is clear that, even with specialized architec-

tures such as LeNet-5, more training data would improve

the accuracy.

Toverify this hypothesis, we articially generated more

training examples by randomly distorting the original

training images. The increased training set was composed

of the 60,000 original patterns

plus 540,000 instances of

剩余45页未读，继续阅读

ture_dream

粉丝: 280
资源: 61

深度学习在文档识别中的应用

LeCun-98-Gradient-Based Learning Applied to Document Recognition

Gradient-Based Learning Applied to Document Recognition.pdf

Gradient-based learning applied to document recognition.pdf

请给我找几篇有关神经网络的文献

卷积神经网络的验证码识别文献

数据集mnist latex引用

卷积神经网络经典论文

基于神经网络的手写数字识别参考文献

MNIST数据集上的神经网络性能的分析实验参考文献

卷积神经网络(CNN)可以引用哪篇参考文献？

最新资源