6 Zobeir Raisi
1
et al.
Recently, several works [47, 48, 115, 116] have treated
scene text detection as an instance segmentation problem,
an example is shown in Fig. 4(b), and many of them have
applied Mask R-CNN [112] framework to improve the per-
formance of scene text detection, which it is useful to im-
plement instance segmentation of text regions and seman-
tic segmentation of word-text instances and make possible
detecting arbitrary shape of text instances. For example, in-
spired by Mask R-CNN, SPCNET [116] uses a text context
module to detect text of arbitrary shape and a re-score mech-
anism to suppress false positives.
However, the methods in [48, 115, 116] have some draw-
backs, which may decline their performance: Firstly, they
suffer from the errors of bounding box handling in a com-
plicated background, where the predicted bounding box fails
to cover the whole text image. Secondly, these methods
[48, 115, 116] aim at separating text pixels from the back-
ground ones that can lead to many mislabeled pixels at the
text borders [47].
Hybrid methods [91, 96, 97, 117] use segmentation-
based approach to predict score maps of text and aim to
obtain text bounding-boxes through regression. For exam-
ple, single-shot text detector (SSTD) [91] used an atten-
tion mechanism to enhance text regions in image and re-
duce background interference on the feature level. Liao
et al. [96] proposed rotation-sensitive regression for ori-
ented scene text detection, which makes full use of rotation-
invariant features by actively rotating the convolutional fil-
ters. However, this method is incapable of capturing all the
other possible text shapes that exist in scene images [46].
Lyu et al. [97] presented a method that detects and groups
corner points of text regions to generate text boxes. Be-
side detecting long oriented text and handling considerable
variation in aspect ratio, this method also requires simple
post-processing. Liu et al. [47] proposed a new Mask R-
CNN-based framework, namely, pyramid mask text detec-
tor (PMTD) that assigns a soft pyramid label, l ∈ [0, 1],
for each pixel in text instance, and then reinterprets the ob-
tained 2D soft mask into the 3D space. They also introduced
a novel plane clustering algorithm to infer the optimal text
box on the basis of the 3D shape and achieved the state-of-
the-art performance on ICDAR LSVT dataset [118]. PMTD
also achieved superior performance on multi-oriented text
datasets [62, 119]. However, due to its framework that de-
signed explicitly for multi-oriented text, it is hard to lever-
age it on curved-text datasets [63, 120], and only the results
of bounding box using polygon bounding boxes are shown
in the paper.
In th following section we describe the scene text recog-
nition algorithms.
2.2 Text Recognition
The scene text recognition task aims to convert detected text
regions into characters or words. Case sensitive character
classes often consist of: 10 digits, 26 lowercase letters, 26
uppercase letters, 32 ASCII punctuation marks, and the end
of sentences (EOS) symbol. However, text recognition mod-
els proposed in the literature have used different choices of
character classes, which Table 3 provides their numbers.
Since the properties of scene text images are different
from that of scanned documents, it is difficult to develop
an effective text recognition method based on a classical
machine learning method, such as [121–126]. As we men-
tioned in Section 1, this is because images captured in the
wild tend to include text under various challenging condi-
tions such as images of low resolution [64, 127], lightning
extreme [64, 127], environmental conditions [62, 114], and
have different number of fonts [62, 114, 128], orientation
angles [63, 128], languages [119] and lexicons [64, 127].
Researchers proposed different techniques to address these
challenging issues, which can be categorized into classical
machine learning-based [20, 28, 29, 64, 84, 123] and deep
learning-based [32, 52–55, 55, 129–140] text recognition
methods, which are discussed in the rest of this section.
2.2.1 Classical Machine Learning-based Methods
In the past two decades, traditional scene text recognition
methods [28, 29, 123] have used standard image features,
such as HOG [73] and SIFT [141], with a classical machine
learning classifier, such as SVM [142] or k-nearest neigh-
bors [143], then a statistical language model or visual struc-
ture prediction is applied to prune-out mis-classified charac-
ters [1, 80].
Most classical machine learning-based methods follow a
bottom-up approach that classified characters are linked up
into words. For example, in [20, 64] HOG features are first
extracted from each sliding window, and then a pre-trained
nearest neighbor or SVM classifier is applied to classify the
characters of the input word image. Neumann and Matas
[84] proposed a set of handcrafted features, which include
aspect and hole area ratios, used with an SVM classifier for
text recognition. However, these methods [20, 22, 64, 84]
cannot achieve either an effective recognition accuracy, due
to the low representation capability of handcrafted features,
or building models that are able to handle text recogni-
tion in the wild. Other works adopted a top-down approach,
where the word is directly recognized from the entire in-
put images, rather than detecting and recognizing individual
characters [144]. For example, Almazan et al. [144] treated
word recognition as a content-based image retrieval prob-
lem, where word image and word labels are embedded into
an Euclidean space and the embedding vectors are used to