1998 LeNet论文：梯度学习在文档识别中的应用

需积分: 9 15 浏览量更新于2024-07-15 收藏 867KB PDF 举报

本资源是一篇1998年的论文，名为"Gradient-Based Learning Applied to Document Recognition"，发表在IEEE Proceedings中。这篇具有里程碑意义的文章由四位作者共同完成：Yann LeCun、Leon Bottou、Yoshua Bengio和Patrick Haffner。LeNet，即LeCun网络，是其中的核心内容，它在该文中首次被提出，标志着深度学习在计算机视觉领域的应用，特别是文档识别中的重要突破。论文主要探讨了使用反向传播（backpropagation）算法训练多层神经网络的技术，这是一种梯度下降优化方法，用于调整网络权重以最小化预测误差。在当时，这种技术对于提高图像处理任务，尤其是手写字符识别的精度有着显著的效果，因为它能够处理复杂的模式识别问题，并通过层级结构学习到数据的特征表示。 LeNet包含卷积层和池化层，这些是深度学习中经典的结构，它们有助于减少模型对输入位置的敏感性，提高了对图像不变性的适应能力。这篇论文不仅展示了这些网络架构的创新设计，还展示了其在实际文档识别任务中的性能提升，这为后续的计算机视觉研究和商业应用奠定了基础。作者Yann LeCun是深度学习的先驱之一，他在蒙特利尔大学有795篇出版物和超过20万次引用，他的工作对整个领域产生了深远影响。其他作者如Leon Bottou和Patrick Haffner也在论文中贡献了自己的专业知识，分别来自Interactions LLC和在视觉描述生成项目中的应用，如Parsing View和Oracle Performance for Visual Captioning。论文下载后，用户提出了改进请求，可能是指寻求更深入的理解、更高质量的翻译或针对最新研究的更新。这篇1998年的LeNet论文是深度学习历史上的一个重要节点，对于理解现代计算机视觉的基础理论和技术发展具有不可替代的价值。

PROC. OF THE IEEE, NOVEMBER 1998 8

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 X X X X X X X X X X

1 X X X X X X X X X X

2 X X X X X X X X X X

3 X X X X X X X X X X

4 X X X X X X X X X X

5 X X X X X X X X X X

TABLE I

Each column indicates which feature map in S2 are combined

by the units in a particular feature map of C3.

combined by each C3 feature map. Why not connect ev-

ery S2 feature map to every C3 feature map? The rea-

son is twofold. First, a non-complete connection scheme

keeps the numb er of connections within reasonable bounds.

More importantly, it forces a break of symmetry in the net-

work. Dierent feature maps are forced to extract dierent

(hopefully complementary) features b ecause they get dif-

ferent sets of inputs. The rationale behind the connection

scheme in table I is the following. The rst six C3 feature

maps take inputs from every contiguous subsets of three

feature maps in S2. The next six take input from every

contiguous subset of four. The next three take input from

some discontinuous subsets of four. Finally the last one

takes input from all S2 feature maps. Layer C3 has 1,516

trainable parameters and 151,600 connections.

Layer S4 is a sub-sampling layer with 16 feature maps of

size 5x5. Each unit in each feature map is connected to a

2x2 neighborho od in the corresp onding feature map in C3,

in a similar way as C1 and S2. Layer S4 has 32 trainable

parameters and 2,000 connections.

Layer C5 is a convolutional layer with 120 feature maps.

Each unit is connected to a 5x5 neighborho o d on all 16

of S4's feature maps. Here, because the size of S4 is also

5x5, the size of C5's feature maps is 1x1: this amounts

to a full connection b etween S4 and C5. C5 is lab eled

as a convolutional layer, instead of a fully-connected layer,

because if LeNet-5 input were made bigger with everything

else kept constant, the feature map dimension would be

larger than 1x1. This process of dynamically increasing the

size of a convolutional network is describ ed in the section

Section VI I. Layer C5 has 48,120 trainable connections.

Layer F6, contains 84 units (the reason for this number

comes from the design of the output layer, explained b e-

low) and is fully connected to C5. It has 10,164 trainable

parameters.

As in classical neural networks, units in layers up to F6

compute a dot pro duct b etween their input vector and their

weight vector, to which a bias is added. This weighted sum,

denoted

for unit

, is then passed through a sigmoid

squashing function to produce the state of unit

, denoted

(

) (5)

The squashing function is a scaled hyp erbolic tangent:

(

) =

tanh(

S a

) (6)

where

is the amplitude of the function and

determines

its slop e at the origin. The function

is o dd, with horizon-

tal asymptotes at +

and

. The constant

is chosen

to b e 1

7159. The rationale for this choice of a squashing

function is given in Appendix A.

Finally, the output layer is composed of Euclidean Radial

Basis Function units (RBF), one for each class, with 84

inputs each. The outputs of each RBF unit

is computed

as follows:

(

)

(7)

In other words, each output RBF unit computes the Eu-

clidean distance b etween its input vector and its parameter

vector. The further away is the input from the parameter

vector, the larger is the RBF output. The output of a

particular RBF can b e interpreted as a p enalty term mea-

suring the t b etween the input pattern and a mo del of the

class associated with the RBF. In probabilistic terms, the

RBF output can b e interpreted as the unnormalized nega-

tive log-likeliho o d of a Gaussian distribution in the space

of congurations of layer F6. Given an input pattern, the

loss function should b e designed so as to get the congu-

ration of F6 as close as p ossible to the parameter vector

of the RBF that corresponds to the pattern's desired class.

The parameter vectors of these units were chosen by hand

and kept xed (at least initially). The comp onents of those

parameters vectors were set to -1 or +1. While they could

have b een chosen at random with equal probabilities for -1

and +1, or even chosen to form an error correcting co de

as suggested by [47], they were instead designed to repre-

sent a stylized image of the corresp onding character class

drawn on a 7x12 bitmap (hence the numb er 84). Such a

representation is not particularly useful for recognizing iso-

lated digits, but it is quite useful for recognizing strings of

characters taken from the full printable ASCI I set. The

rationale is that characters that are similar, and therefore

confusable, such as upp ercase O, lowercase O, and zero, or

lowercase l, digit 1, square brackets, and uppercase I, will

have similar output co des. This is particularly useful if the

system is combined with a linguistic p ost-pro cessor that

can correct such confusions. Because the codes for confus-

able classes are similar, the output of the corresponding

RBFs for an ambiguous character will b e similar, and the

post-pro cessor will b e able to pick the appropriate interpre-

tation. Figure 3 gives the output codes for the full ASCI I

set.

Another reason for using such distributed co des, rather

than the more common \1 of N" co de (also called place

code, or grand-mother cell code) for the outputs is that

non distributed codes tend to b ehave badly when the num-

ber of classes is larger than a few dozens. The reason is

that output units in a non-distributed co de must be o

most of the time. This is quite dicult to achieve with

sigmoid units. Yet another reason is that the classiers are

often used to not only recognize characters, but also to re-

ject non-characters. RBFs with distributed co des are more

appropriate for that purpose b ecause unlike sigmoids, they

are activated within a well circumscribed region of their in-

PROC. OF THE IEEE, NOVEMBER 1998 9

! " # $ % & ’ ( ) * + , − . /

0 1 2 3 4 5 6 7 8 9 : ; < = > ?

@ A B C D E F G H I J K L M N O

P Q R S T U V W X Y Z [ \ ] ^ _

‘ a b c d e f g h i j k l m n o

p q r s t u v w x y z { | } ~ 

Fig. 3. Initial parameters of the output RBFs for recognizing the

full ASCI I set.

put space that non-typical patterns are more likely to fall

outside of.

The parameter vectors of the RBFs play the role of target

vectors for layer F6. It is worth p ointing out that the com-

ponents of those vectors are +1 or -1, which is well within

the range of the sigmoid of F6, and therefore prevents those

sigmoids from getting saturated. In fact, +1 and -1 are the

points of maximum curvature of the sigmoids. This forces

the F6 units to op erate in their maximally non-linear range.

Saturation of the sigmoids must be avoided b ecause it is

known to lead to slow convergence and ill-conditioning of

the loss function.

C. Loss Function

The simplest output loss function that can be used with

the ab ove network is the Maximum Likelihoo d Estimation

criterion (MLE), which in our case is equivalent to the Min-

imum Mean Squared Error (MSE). The criterion for a set

of training samples is simply:

(

) =

(

; W

) (8)

where

is the output of the

-th RBF unit, i.e. the

one that corresp onds to the correct class of input pattern

. While this cost function is appropriate for most cases,

it lacks three imp ortant properties. First, if we allow the

parameters of the RBF to adapt,

(

) has a trivial, but

totally unacceptable, solution. In this solution, all the RBF

parameter vectors are equal, and the state of F6 is constant

and equal to that parameter vector. In this case the net-

work happily ignores the input, and all the RBF outputs

are equal to zero. This collapsing phenomenon do es not

occur if the RBF weights are not allowed to adapt. The

second problem is that there is no competition b etween

the classes. Such a competition can b e obtained by us-

ing a more discriminative training criterion, dubbed the

MAP (maximum a p osteriori) criterion, similar to Maxi-

mum Mutual Information criterion sometimes used to train

HMMs [48], [49], [50]. It corresp onds to maximizing the

posterior probability of the correct class

(or minimiz-

ing the logarithm of the probability of the correct class),

given that the input image can come from one of the classes

or from a background \rubbish" class lab el. In terms of

penalties, it means that in addition to pushing down the

penalty of the correct class like the MSE criterion, this

criterion also pulls up the p enalties of the incorrect classes:

(

) =

(

; W

) + log(

(

)

))

(9)

The negative of the second term plays a \comp etitive" role.

It is necessarily smaller than (or equal to) the rst term,

therefore this loss function is positive. The constant

positive, and prevents the p enalties of classes that are al-

ready very large from b eing pushed further up. The pos-

terior probability of this rubbish class lab el would b e the

ratio of

and

(

)

. This discrimina-

tive criterion prevents the previously mentioned \collaps-

ing eect" when the RBF parameters are learned because

it keeps the RBF centers apart from each other. In Sec-

tion VI, we present a generalization of this criterion for

systems that learn to classify multiple ob jects in the input

(e.g., characters in words or in do cuments).

Computing the gradient of the loss function with resp ect

to all the weights in all the layers of the convolutional

network is done with back-propagation. The standard al-

gorithm must b e slightly mo died to take account of the

weight sharing. An easy way to implement it is to rst com-

pute the partial derivatives of the loss function with resp ect

to each

connection

, as if the network were a conventional

multi-layer network without weight sharing. Then the par-

tial derivatives of all the connections that share a same

parameter are added to form the derivative with respect to

that parameter.

Such a large architecture can b e trained very eciently,

but doing so requires the use of a few techniques that are

described in the app endix. Section A of the app endix

describes details such as the particular sigmoid used, and

the weight initialization. Section B and C describe the

minimization pro cedure used, which is a sto chastic version

of a diagonal approximation to the Levenb erg-Marquardt

procedure.

II I. Results and Comparison with Other

Methods

While recognizing individual digits is only one of many

problems involved in designing a practical recognition sys-

tem, it is an excellent b enchmark for comparing shape

recognition methods. Though many existing metho d com-

bine a hand-crafted feature extractor and a trainable clas-

sier, this study concentrates on adaptive metho ds that

operate directly on size-normalized images.

A. Database: the Modied NIST set

The database used to train and test the systems de-

scribed in this pap er was constructed from the NIST's Sp e-

cial Database 3 and Special Database 1 containing binary

images of handwritten digits. NIST originally designated

SD-3 as their training set and SD-1 as their test set. How-

ever, SD-3 is much cleaner and easier to recognize than SD-

1. The reason for this can b e found on the fact that SD-3

剩余46页未读，继续阅读

明里MX

粉丝: 7
资源: 5

1998 LeNet论文：梯度学习在文档识别中的应用

DownLoadPDF.zip

1715332433151312_download.pdf

download.pdf_meshmessmethods_

Mobile Game Asset Download.pdf

[June-2016] New 200-125 Exam Dumps with PDF and VCE Download.pdf

ALOS DSM-30-aw3d30v21_format_e-free download.pdf

download.eeworld.com.cn_STM32L151STM32L152 datasheet.pdf

upload and download from bps.pdf

Download OpenOCD for Windows.pdf

dubbo.io 文档（dubbo-user-book.pdf、dubbo-dev-book.pdf、dubbo-admin-book.pdf）

最新资源