By contrast, the goal of integrated methodologies is to
identify specific words in imagery with character and lan-
guage models. Integrated methodologies can avoid the chal-
lenging segmentation step or optimize it with character and
word recognition, which makes it less sensitive to complex
backgrounds and low resolution text. The disadvantage lies
in that the multi-class character classification procedure is
computationally expensive when considering a large char-
acter class number and a large amount of candidate win-
dows. In addition, the increase of word class number could
significantly decrease the detection and recognition perfor-
mance, so the generality is often limited to a small lexicon of
words.
4FUNDAMENTAL SUB-PROBLEMS
In this section, sub-problems including text localization,
verification, segmentation, and recognition are described.
Each approach is reviewed with respect to its primary con-
tribution. The approaches that make multiple contributions
are analyzed with respect to each contribution.
4.1 Text Localization
The objective of text localization is to localize text compo-
nents precisely as well as to group them into candidate text
regions with as little background as possible. For text locali-
zation, connected component analysis (CCA) and sliding
window classification are two widely used methods, and
color, edges, strokes, and texture are typically used as
features.
4.1.1 Methods
Connected component analysis. CCA could be regarded as a
graph algorithm, where subsets of connected components
are uniquely labeled based on heuristics about feature con-
sensus, i.e., color similarity and spatial layout. In implemen-
tations of CCA, syntactic pattern recognition methods are
often used to analyze the spatial and feature consensus, and
to define text regions. Considering the complexity of fine-
turning the syntactic rules, a new trend is to perform CCA
with statistical models [109], [138], [182], e.g., using an Ada-
Boost classifier on pairwise spatial features to learn the
CCA models [182]. The use of statistical models in CCA sig-
nificantly improves its adaptivity.
Sliding window classification. In the sliding window classi-
fication method, multi-scale image windows that are classi-
fied into positives are further grouped into text regions with
morphological operations [130], CRF [148] or graph meth-
ods [123], [173]. The advantage of this method lies in the
simple and adaptive training-detection architecture. Never-
theless, it is often computationally expensive when complex
classification methods are used and a large number of win-
dows need to be classified.
4.1.2 Features
For text localization, color [174], edge [28] and texture fea-
tures [19] were conventionally used, and stroke [47], [107],
[163], point [152], region [137], [138], [150], [164], [182] and
character appearance features [94], [196], [198], [199] have
recently been explored.
Color features. Text is often produced in a consistent and
distinguishable color so that it contrasts with the back-
ground [40]. Under this assumption, color features could be
used to localize text [2], [22], [54], [63], [82], [92], [96], [109],
[150]. As a 20-year old method, color-based text localization
operates often simply and efficiently, although it is sensitive
to multi-color characters and uneven lighting, which can
seriously degrade color features.
An early color-based text localization approach is from
Jain and Yu [2]. They used color reduction to generate color
layers, a clustering algorithm to obtain CCs, and connected
CCs into text candidates with color similarity and compo-
nent layout analysis. In other work [95], it was shown that
the use of a mean-shift algorithm to generate color layers
could improve the robustness to complex backgrounds.
To be adaptive to color variation, color features are
extracted in converted or combined color spaces or
described with mixture models [27], [74], [76], [174]. In [7],
Garcia and Apostolidis performed text extraction with a k-
means clustering algorithm in the hue-saturation-value
(HSV) color space. Karatzas and Antonacopoulos [33]
extracted text components with a split-and-merge strategy
in the hue-lightness-saturation (HLS) color space. Chen
et al. [26] proposed using Gaussian mixture models in R, G,
B, hue and intensity channels to localize text.
Edge/Gradient features. The family of edge/gradient-based
approaches assumes that text exhibits a strong and symmet-
ric gradient against its background. Thus, those pixels with
large and symmetric gradient values could be regarded as
text components. In [4], [12], [23], [27], [80], [114], [177],
[181] edge features are used to detect text components, and
in [12], [24], [71], [98] gradient features are used.
Wu et al. [4] proposed using Gaussian derivatives to
extract horizontally aligned vertical edges, which are aggre-
gated to produce chips corresponding to text strings if
”short paths” exist between edge pairs. In recent work
[167], Phan et al. proposed grouping horizontally aligned
components of ”gradient vector flow” into text candidates
based on spatial constraints of sizes, positions and color
distances.
Compared with color features, gradient/edge features
are less sensitive to uneven lighting and multi-color charac-
ters [9]. They are combined with such classifiers as artificial
neural networks [14], [16] or Adaboost [28], [68] to perform
sliding window based text localization. However, they often
have difficulty when discriminating text components with
complex backgrounds having a strong gradient.
Texture features. When characters are dense, text could be
considered as a texture [29]. Texture features including
Fourier Transform [116], Discrete Cosine Transform (DCT)
[8], Wavelet [5], [49], LBP, and HOG [113] have been used
to localize text. Such features are usually combined with a
multi-scale sliding window classification method to per-
form text localization. Texture features are effective for
detecting dense characters, although they might not detect
sparse characters, i.e., signs in scene images which lack sig-
nificant texture properties.
Li et al. pioneered the text localization method with
Wavelet texture features [5]. They proposed using mean,
second and third order central moments of wavelet coeffi-
cients and a neural network to classify image windows, of
1484 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 7, JULY 2015