Scene Text Detection via Edge Cue and Multi-Features
Youbao Tang, Xiangqian Wu
The School of Computer Science and Technology
Harbin Institute of Technology
Harbin, China
tangyoubao@hit.edu.cn, xqwu@hit.edu.cn
Abstract—Inspired by the fact that edge is an important cue to
distinguish texts from background, we propose a novel scene
text detection method via edge cue and multiple features, which
has two main parts, i.e. candidate character region (CCR)
extraction and region classification. For CCR extraction, the
edges are first extracted from the input image, which are then
broken and merged based on color features to form the final
edge image. For each edge connected component, a number of
image patches are extracted by translating and scaling its
boundary rectangle to generate the CCRs. For region
classification, the character regions are extracted from the
CCRs by using a region classification technique, which extracts
both the hand-designed low-level features and deep convolution
neural network based high-level features of the regions for
classification. And then the character regions are merged to
form the candidate text regions, based on which the final text
region are detected by using the region classification technique.
The proposed method is evaluated on two latest ICDAR
benchmark datasets and the experimental results demonstrate
that the proposed method outperforms the state-of-the-art
approaches of scene text detection.
Keywords-scene text detection; candidate region extraction;
region classification; edge cue; multiple features
I.
I
NTRODUCTION
Scene text detection aims to locate the position of texts in
different scenes, e.g. guideposts, store marks, and warning
signs, as shown in Fig. 1, which is one of the most important
steps for end-to-end scene text recognition. Effective scene
text detection can enhance the performances of numerous
multimedia applications, e.g. mobile visual search, content-
based image retrieval, and sign automatically translation.
Because of the unconstrained scene environments, e.g.
different text sizes, colors, and complex backgrounds, scene
text detection is still a challenging problem in computer vision
community. Over the past years, a large number of scene text
detection approaches [1-20] have been proposed, most of
which have been summarized by Ye and Doermann [21]. And
a series of international scene text detection competitions have
been successfully organized [22-24]. Generally, the existing
approaches can be roughly divided into two groups: sliding
window based approaches and connected component based
approaches. Here, we simply summarize the previous scene
text detection approaches, and then discuss the most related
work with ours in detail.
The sliding window based approaches [10-13] first slide a
large number of windows with different scales through all
possible positions of the image and then extract features to
classify the regions in sliding windows into texts or
background. One advantage of these approaches is keeping
almost all of the true text regions. At the same time, they
generate numerous candidate regions, which need to be
classified in the next processes, resulting in high computation.
The key factor to decide the detection performance is the
discriminability of the extracted features. At the beginning,
the hand-designed low-level features, e.g. HOG and SIFT, are
extracted [11, 12]. To improve the classification performance,
some researchers [10, 13] use the convolution neural network
(CNN) to learn deep high-level features in recent and get the
state-of-the-art results. The connected component based
approaches [1-9, 14, 15, 18, 20] first cluster the pixels into
larger connected components according to the pixels’
properties, e.g. intensity, color, and stroke width, and then
extract features from connected components for classification.
One advantage of these approaches is greatly reducing the
number of candidate regions, but losing some true character
regions. Almost all of the above two kinds of approaches only
use the hand-designed low-level features or CNN based high-
level feature for region classification.
Inspired by the fact that edge is one of the most important
cues to distinguish texts from background, this paper proposes
a novel scene text detection method via edge cue and multiple
features, which consists of two main stages, i.e. candidate
character region extraction and region classification. For the
stage of candidate character region extraction, the proposed
(a) (b) (c)
Figure 1. Detection results of the proposed method (indicated by blue
rectangle) on different scene images, which nearly match the ground truths
(indicated by red and green rectangles). (a) Guidepost. (b) Store mark. (c)
Warning sign.
2016 15th International Conference on Frontiers in Handwriting Recognition
2167-6445/16 $31.00 © 2016 IEEE
DOI 10.1109/ICFHR.2016.37
156