4 Z. Tian, W. Huang, T. He, P. He and Y. Qiao
in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR
2015). Furthermore, it is computationally efficient, resulting in a 0.14s/image
running time (on the ICDAR 2013) by using the very deep VGG16 model [27].
2 Related Work
Text detection. Past works in scene text detection have been dominated by
bottom-up approaches which are generally built on stroke or character detection.
They can be roughly grouped into two categories, connected-components (CCs)
based approaches and sliding-window based methods. The CCs based approaches
discriminate text and non-text pixels by using a fast filter, and then text pix-
els are greedily grouped into stroke or character candidates, by using low-level
properties, e.g., intensity, color, gradient, etc. [33,14,32,13,3]. The sliding-window
based methods detect character candidates by densely moving a multi-scale win-
dow through an image. The character or non-character window is discriminated
by a pre-trained classifier, by using manually-designed features [28,29], or recent
CNN features [16]. However, both groups of methods commonly suffer from poor
performance of character detection, causing accumulated errors in following com-
ponent filtering and text line construction steps. Furthermore, robustly filtering
out non-character components or confidently verifying detected text lines are
even difficult themselves [1,33,14]. Another limitation is that the sliding-window
methods are computationally expensive, by running a classifier on a huge number
of the sliding windows.
Object detection. Convolutional Neural Networks (CNN) have recently
advanced general object detection substantially [25,5,6]. A common strategy
is to generate a number of object proposals by employing inexpensive low-level
features, and then a strong CNN classifier is applied to further classify and refine
the generated proposals. Selective Search (SS) [4] which generates class-agnostic
object proposals, is one of the most popular methods applied in recent leading
object detection systems, such as Region CNN (R-CNN) [6] and its extensions [5].
Recently, Ren et al. [25] proposed a Faster R-CNN system for object detection.
They proposed a Region Proposal Network (RPN) that generates high-quality
class-agnostic object proposals directly from the convolutional feature maps. The
RPN is fast by sharing convolutional computation. However, the RPN proposals
are not discriminative, and require a further refinement and classification by an
additional costly CNN model, e.g., the Fast R-CNN model [5]. More importantly,
text is different significantly from general objects, making it difficult to directly
apply general object detection system to this highly domain-specific task.
3 Connectionist Text Proposal Network
This section presents details of the Connectionist Text Proposal Network (CTPN).
It includes three key contributions that make it reliable and accurate for text
localization: detecting text in fine-scale proposals, recurrent connectionist text
proposals, and side-refinement.