4 Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai
2.2 Scene Text Recognition
Scene text recognition [53, 46] aims at decoding the detected or cropped image
regions into character sequences. The previous scene text recognition approaches
can be roughly split into three branches: character-based methods, word-based
methods, and sequence-based methods. The character-based recognition meth-
ods [2, 22] mostly first localize individual characters and then recognize and
group them into words. In [20], Jaderberg et al. propose a word-based method
which treats text recognition as a common English words (90k) classification
problem. Sequence-based methods solve text recognition as a sequence labeling
problem. In [44], Shi et al. use CNN and RNN to model image features and
output the recognized sequences with CTC [11]. In [26, 45], Lee et al. and Shi
et al. recognize scene text via attention based sequence-to-sequence model.
The proposed text recognition component in our framework can be classified
as a character-based method. However, in contrast to previous character-based
approaches, we use an FCN [42] to localize and classify characters simultaneously.
Besides, compared with sequence-based methods which are designed for a 1-D
sequence, our method is more suitable to handle irregular text (multi-oriented
text, curved text et al.).
2.3 Scene Text Spotting
Most of the previous text spotting methods [21, 30, 12, 29] split the spotting
process into two stages. They first use a scene text detector [21, 30, 29] to localize
text instances and then use a text recognizer [20, 44] to obtain the recognized
text. In [27, 3], Li et al. and Busta et al. propose end-to-end methods to localize
and recognize text in a unified network, but require relatively complex training
procedures. Compared with these methods, our proposed text spotter can not
only be trained end-to-end completely, but also has the ability to detect and
recognize arbitrary-shape (horizontal, oriented, and curved) scene text.
2.4 General Object Detection and Semantic Segmentation
With the rise of deep learning, general object detection and semantic segmenta-
tion have achieved great development. A large number of object detection and
segmentation methods [9, 8, 40, 6, 32, 33, 39, 42, 5, 28, 13] have been pro-
posed. Benefited from those methods, scene text detection and recognition have
achieved obvious progress in the past few years. Our method is also inspired
by those methods. Specifically, our method is adapted from a general object in-
stance segmentation model Mask R-CNN [13]. However, there are key differences
between the mask branch of our method and that in Mask R-CNN. Our mask
branch can not only segment text regions but also predict character probabil-
ity maps, which means that our method can be used to recognize the instance
sequence inside character maps rather than predicting an object mask only.