An End-to-End Trainable Neural Network for Image-based Sequence
Recognition and Its Application to Scene Text Recognition
Baoguang Shi, Xiang Bai and Cong Yao
School of Electronic Information and Communications
Huazhong University of Science and Technology, Wuhan, China
{shibaoguang,xbai}@hust.edu.cn, yaocong2010@gmail.com
Abstract
Image-based sequence recognition has been a long-
standing research topic in computer vision. In this pa-
per, we investigate the problem of scene text recognition,
which is among the most important and challenging tasks
in image-based sequence recognition. A novel neural net-
work architecture, which integrates feature extraction, se-
quence modeling and transcription into a unified frame-
work, is proposed. Compared with previous systems for
scene text recognition, the proposed architecture possesses
four distinctive properties: (1) It is end-to-end trainable,
in contrast to most of the existing algorithms whose compo-
nents are separately trained and tuned. (2) It naturally han-
dles sequences in arbitrary lengths, involving no character
segmentation or horizontal scale normalization. (3) It is not
confined to any predefined lexicon and achieves remarkable
performances in both lexicon-free and lexicon-based scene
text recognition tasks. (4) It generates an effective yet much
smaller model, which is more practical for real-world ap-
plication scenarios. The experiments on standard bench-
marks, including the IIIT-5K, Street View Text and ICDAR
datasets, demonstrate the superiority of the proposed algo-
rithm over the prior arts. Moreover, the proposed algorithm
performs well in the task of image-based music score recog-
nition, which evidently verifies the generality of it.
1. Introduction
Recently, the community has seen a strong revival of
neural networks, which is mainly stimulated by the great
success of deep neural network models, specifically Deep
Convolutional Neural Networks (DCNN), in various vision
tasks. However, majority of the recent works related to deep
neural networks have devoted to detection or classification
of object categories [12, 25]. In this paper, we are con-
cerned with a classic problem in computer vision: image-
based sequence recognition. In real world, a stable of vi-
sual objects, such as scene text, handwriting and musical
score, tend to occur in the form of sequence, not in isola-
tion. Unlike general object recognition, recognizing such
sequence-like objects often requires the system to predict
a series of object labels, instead of a single label. There-
fore, recognition of such objects can be naturally cast as a
sequence recognition problem. Another unique property of
sequence-like objects is that their lengths may vary drasti-
cally. For instance, English words can either consist of 2
characters such as “OK” or 15 characters such as “congrat-
ulations”. Consequently, the most popular deep models like
DCNN [25, 26] cannot be directly applied to sequence pre-
diction, since DCNN models often operate on inputs and
outputs with fixed dimensions, and thus are incapable of
producing a variable-length label sequence.
Some attempts have been made to address this problem
for a specific sequence-like object (e.g. scene text). For
example, the algorithms in [35, 8] firstly detect individual
characters and then recognize these detected characters with
DCNN models, which are trained using labeled character
images. Such methods often require training a strong char-
acter detector for accurately detecting and cropping each
character out from the original word image. Some other
approaches (such as [22]) treat scene text recognition as
an image classification problem, and assign a class label
to each English word (90K words in total). It turns out a
large trained model with a huge number of classes, which
is difficult to be generalized to other types of sequence-
like objects, such as Chinese texts, musical scores, etc., be-
cause the numbers of basic combinations of such kind of
sequences can be greater than 1 million. In summary, cur-
rent systems based on DCNN can not be directly used for
image-based sequence recognition.
Recurrent neural networks (RNN) models, another im-
portant branch of the deep neural networks family, were
mainly designed for handling sequences. One of the ad-
vantages of RNN is that it does not need the position of
each element in a sequence object image in both training
and testing. However, a preprocessing step that converts
1
arXiv:1507.05717v1 [cs.CV] 21 Jul 2015