PROC. OF THE IEEE, NOVEMBER 1998 8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 X X X X X X X X X X
1 X X X X X X X X X X
2 X X X X X X X X X X
3 X X X X X X X X X X
4 X X X X X X X X X X
5 X X X X X X X X X X
TABLE I
Each column indicates which feature map in S2 are combined
by the units in a particular feature map of C3.
combined byeach C3 feature map. Why not connect ev-
ery S2 feature map to every C3 feature map? The rea-
son is twofold. First, a non-complete connection scheme
keeps the numb er of connections within reasonable bounds.
More importantly, it forces a break of symmetry in the net-
work. Dierent feature maps are forced to extract dierent
(hopefully complementary) features b ecause they get dif-
ferent sets of inputs. The rationale behind the connection
scheme in table I is the following. The rst six C3 feature
maps take inputs from every contiguous subsets of three
feature maps in S2. The next six take input from every
contiguous subset of four. The next three take input from
some discontinuous subsets of four. Finally the last one
takes input from all S2 feature maps. Layer C3 has 1,516
trainable parameters and 151,600 connections.
Layer S4 is a sub-sampling layer with 16 feature maps of
size 5x5. Eachunitineach feature map is connected to a
2x2 neighb orhood in the corresponding feature map in C3,
in a similar way as C1 and S2. Layer S4 has 32 trainable
parameters and 2,000 connections.
Layer C5 is a convolutional layer with 120 feature maps.
Each unit is connected to a 5x5 neighborhood on all 16
of S4's feature maps. Here, because the size of S4 is also
5x5, the size of C5's feature maps is 1x1: this amounts
to a full connection between S4 and C5. C5 is labeled
as a convolutional layer, instead of a fully-connected layer,
because if LeNet-5 input were made bigger with everything
else kept constant, the feature map dimension would be
larger than 1x1. This process of dynamically increasing the
size of a convolutional network is describ ed in the section
Section VI I. Layer C5 has 48,120 trainable connections.
Layer F6, contains 84 units (the reason for this number
comes from the design of the output layer, explained be-
low) and is fully connected to C5. It has 10,164 trainable
parameters.
As in classical neural networks, units in layers up to F6
compute a dot pro duct between their input vector and their
weightvector, to whichabiasisadded. This weighted sum,
denoted
a
i
for unit
i
, is then passed through a sigmoid
squashing function to produce the state of unit
i
, denoted
by
x
i
:
x
i
=
f
(
a
i
) (5)
The squashing function is a scaled hyperbolic tangent:
f
(
a
)=
A
tanh(
Sa
) (6)
where
A
is the amplitude of the function and
S
determines
its slop e at the origin. The function
f
is o dd, with horizon-
tal asymptotes at +
A
and
;
A
. The constant
A
is chosen
to b e 1
:
7159. The rationale for this choice of a squashing
function is given in Appendix A.
Finally, the output layer is composed of Euclidean Radial
Basis Function units (RBF), one for each class, with 84
inputs each. The outputs of each RBF unit
y
i
is computed
as follows:
y
i
=
X
j
(
x
j
;
w
ij
)
2
:
(7)
In other words, each output RBF unit computes the Eu-
clidean distance b etween its input vector and its parameter
vector. The further away is the input from the parameter
vector, the larger is the RBF output. The output of a
particular RBF can b e interpreted as a penalty term mea-
suring the t between the input pattern and a mo del of the
class associated with the RBF. In probabilistic terms, the
RBF output can b e in
terpreted as the unnormalized nega-
tive log-likeliho od of a Gaussian distribution in the space
of congurations of layer F6. Given an input pattern, the
loss function should b e designed so as to get the congu-
ration of F6 as close as p ossible to the parameter vector
of the RBF that corresponds to the pattern's desired class.
The parameter vectors of these units were chosen byhand
and kept xed (at least initially). The components of those
parameters vectors were set to -1 or +1. While they could
have been chosen at random with equal probabilities for -1
and +1, or even chosen to form an error correcting co de
as suggested by 47], they were instead designed to repre-
sentastylized image of the corresp onding character class
drawn on a 7x12 bitmap (hence the number 84). Such a
representation is not particularly useful for recognizing iso-
lated digits, but it is quite useful for recognizing strings of
characters taken from the full printable ASCI I set. The
rationale is that characters that are similar, and therefore
confusable, suchasuppercaseO,lowercase O, and zero, or
lowercase l, digit 1, square brackets, and uppercase I, will
ha
ve similar output codes. This is particularly useful if the
system is combined with a linguistic p ost-processor that
can correct such confusions. Because the codes for confus-
able classes are similar, the output of the corresponding
RBFs for an ambiguous character will b e similar, and the
post-pro cessor will b e able to pick the appropriate interpre-
tation. Figure 3 gives the output codes for the full ASCI I
set.
Another reason for using such distributed co des, rather
than the more common \1 of N" co de (also called place
code, or grand-mother cell code) for the outputs is that
non distributed codes tend to behave badly when the num-
ber of classes is larger than a few dozens. The reason is
that output units in a non-distributed co de must be o
most of the time. This is quite dicult to achieve with
sigmoid units. Yet another reason is that the classiers are
often used to not only recognize c
haracters, but also to re-
ject non-characters. RBFs with distributed codes are more
appropriate for that purp ose b ecause unlike sigmoids, they
are activated within a well circumscrib ed region of their in-