Scene Text Recognition Based on Deep Learning: A Brief Survey
Yuxin Chen
College of Computer Science
Inner Mongolia University
Hohhot, China
e-mail: cyx3292@163.com
Yunxue Shao
College of Computer Science
Inner Mongolia University
Hohhot, China
e-mail: csshyx@imu.edu.cn
Abstract—Scene text recognition is a universal text recognition
technology, which has become a research hotspot in computer
field in recent years. Compared with the traditional document
text recognition, the scene text recognition is more complex in
aspect of font, distribution, background and so on. Which
makes the traditional OCR technology no longer adapt to the
new challenge. With the development of technology, deep
learning has achieved good results in the field of image
recognition. Therefore, this paper mainly summarizes the
representative achievements in scene text recognition field
based on deep learning method.
Keywords-deep learning; scene text recognition;
convolutional neural networks; recurrent neural network
I. INTRODUCTION
Text is one of the main ways of information transmission
and interaction among human beings and plays an
indispensable role in our life. In the natural scene image,
there are a lot of text information, extracting text information
from images of natural scenes can help us to understand
images better. Therefore, text recognition in natural scenes
has important theoretical research value and practical
application value.
In recent years, deep learning technology develops
rapidly and plays a leading role in the field of OCR. The
OCR technology based on deep learning has achieved
significant improvements in both the accuracy and efficiency
of text recognition. In view of this, this paper summarizes the
representative achievements in the field of scene text
recognition based on deep learning method, hoping to help
readers who are interested in deep learning and scene text
recognition.
II. B
ACKGROUND KNOWLEDGE
A. Deep Learning Theory
Deep learning is a very popular research direction in the
field of machine learning in recent years. It is a deep network
structure based on multiple hidden layers. A more abstract
high-level representation attribute categories or features are
formed by combining low-level features to discover a
distributed feature representation of the data. Deep learning
transforms the original data into higher-level and more
abstract feature expressions through a large number of non-
linear transformations. With enough combination of non-
linear transformations, deep neural networks can learn very
complex functions. Generally speaking, deep learning
requires to train a large number of data in order to make the
neural networks have good generalization ability.
B. Scene Text Recognition Process
A typical natural scene text processing mainly consists of
two parts: text detection and text recognition. The main
function of text detection is to find the text area from the
image and separate the text area from the original image. The
main function of text recognition is to recognize text on the
separated image. This paper mainly introduces text
recognition, which is usually divided into the following steps:
Preprocessing: The text area obtained by the detection
step usually are affected by some factor, for example noice.
Therefore, it is necessary to preprocess the image before text
recognition. Preprocessing usually includes the following
steps: denoising, image enhancement, and scaling.
Feature extraction: It is often difficult to achieve an ideal
results by directly recognizing words at the pixel level, so it
is necessary to define a set of features to represent the image.
Some commonly used features include edge features, stroke
features, and structural features and so on.
Recognition: Text recognition task can be considered as
classification task, each character represents a category. The
recognizer makes extracted features as input and outputs
corresponding characters or words. Common recognizers
include random forest, support vector machine, neural
network and so on.
Figure 1. Scene text recognition process
C. Research on Scene Text Recognition
In recent years, some articles about scene text recognition
methods have been published in various academic journals.
Referring to relevant references, it can be found that
recognition methods can be roughly divided into two
categories according to the algorithms used for classification:
One is based on traditional methods; The other is based on
deep learning. The two methods are described as follow:
688
2019 IEEE 11th International Conference on Communication Software and Networks
978-1-7281-2184-0/19/$31.00 ©2019 IEEE