Local Discriminant Training and Global Optimization for Convolutional Neural
Network based Handwritten Chinese Character Recognition
Xiangsheng Zeng, Donglai Xiang, Liangrui Peng, Changsong Liu, Xiaoqing Ding
Tsinghua National Laboratory for Information Science and Technology
Department of Electronic Engineering, Tsinghua University, Beijing, China, 100084
Email: {cengxs13, xdl13}@tsinghua.org.cn; {penglr, lcs, dingxq}@tsinghua.edu.cn
Abstract—This paper investigates local discriminant training
and global optimization methods for Convolutional Neural
Network (CNN) to improve its discriminant ability and recog-
nition accuracy. For local discriminant training, we propose to
combine triplet loss and softmax with cross-entropy loss as the
loss function. The triplet loss is incorporated into an additional
fully-connected layer before the final fully-connected layer of
a CNN model. For global optimization, we use Conditional
Random Field (CRF) to further utilize the pairwise distance of
the CNN feature vectors trained with triplet loss. Experiments
with different CNN models on handwritten Chinese character
samples show that the combined local discriminant training
and global optimization scheme achieves better character
recognition accuracy and confidence analysis performance.
Keywords-convolutional neural network; handwritten Chi-
nese character recognition; triplet loss; conditional random
field;
I. INTRODUCTION
Convolutional Neural Network (CNN) has provided an
end-to-end solution for character recognition, image classi-
fication and other machine learning tasks. However, Nguyen
et al. [1] demonstrate that CNNs are easily fooled in
that they classify many unrecognizable images with near-
certainty as members of a recognizable class. In a practical
Optical Character Recognition (OCR) system, the input of
a CNN model for segmentation based character recognition
usually include outliers such as over-segmented characters
and touched characters. Reliable confidence analysis will
provide better rejection for these outliers. This motivates us
to improve the recognition accuracy and confidence analysis
performance of CNN by using discriminant training or
optimization.
For discriminant training of CNN, it is straight forward to
incorporate discriminative feature [2] [3]. Fukuda et al. [2]
propose to use projection matrices composed of eigenvectors
estimated by Linear Discriminant Analysis (LDA) objective
function as initial weights for the first convolutional layer in
CNN. Chen et al. [3] present a novel and effective method
to learn a rotation-invariant and Fisher discriminative CNN
(RIFD-CNN) model. This is achieved by introducing and
learning a rotation-invariant layer and a Fisher discrimina-
tive layer on the basis of the existing high-capacity CNN
architectures. The Fisher discriminative layer is trained by
imposing the Fisher discrimination criterion on the CNN
features so that they have small within-class variation and
large between-class variation.
As a local discriminant training strategy, triplet loss is first
proposed for face verification to enforce a margin between
each pair of faces from one person to all other faces [4].
The formulations of triplet units, loss functions and sample
mining methods have received a lot of attention. Song et
al. [5] propose to learn semantic CNN feature embedding
where similar samples are mapped close to each other and
samples from different classes are mapped apart. Zhang et
al. [6] optimize the max-margin loss on triplet samples to
learn deep hashing function for image retrieval. Wang et al.
[7] propose an efficient triplet sampling algorithm to learn
fine-grained image similarity. Wang et al. [8] propose a hard
negative mining method for triplet sampling. Simo-Serra et
al. [9] present an aggressive mining strategy biased towards
patches that are hard to classify. Shi et al. [10] propose a
novel moderate positive sample mining method to deal with
the problem of large within-class variation.
For global optimization of CNN, Jaderberg et al. [11]
incorporate Conditional Random Field (CRF) to find the
character sequence that maximizes the CRF score, enforcing
the consistency of the individual predictions. Isola et al. [12]
also propose to solve the image collection parsing problem
by using CRF, similar to what has been proposed for solving
the pixel parsing problem from the global optimization point
of view.
Inspired by the above works, we propose the local dis-
criminant training and global optimization strategies for
CNN. The local discriminant training is realized by in-
corporating triplet loss to the loss function in the training
process, and the global optimization is fulfilled by deploying
CRF. We conduct experiments on the ICDAR 2013 of-
fline handwritten Chinese character recognition competition
dataset. Although the state-of-the-art performance on the
ICDAR 2013 competition dataset has been achieved by
Zhang et al. [13] with the combination method of the
traditional feature extraction and the writer adapted deep
convolutional neural network, our focus is to compare the
recognition accuracy and confidence analysis performance
of the local discriminant training and global optimization
2017 14th IAPR International Conference on Document Analysis and Recognition
2379-2140/17 $31.00 © 2017 IEEE
DOI 10.1109/ICDAR.2017.70
383