1
Abstract
In this paper, we propose a novel method called
Rotational Region CNN (R
2
CNN) for detecting
arbitrary-oriented texts in natural scene images. The
framework is based on Faster R-CNN [1] architecture.
First, we use the Region Proposal Network (RPN) to
generate axis-aligned bounding boxes that enclose the texts
with different orientations. Second, for each axis-aligned
text box proposed by RPN, we extract its pooled features
with different pooled sizes and the concatenated features
are used to simultaneously predict the text/non-text score,
axis-aligned box and inclined minimum area box. At last,
we use an inclined non-maximum suppression to get the
detection results. Our approach achieves competitive
results on text detection benchmarks: ICDAR 2015 and
ICDAR 2013.
1. Introduction
Texts in natural scenes (e.g., street nameplates, store
names, good names) play an important role in our daily life.
They carry essential information about the environment.
After understanding scene texts, they can be used in many
areas, such as text-based retrieval, translation, etc. There
are usually two key steps to understand scene texts: text
detection and text recognition. This paper focuses on scene
text detection. Scene text detection is challenging because
scene texts have different sizes, width-height aspect ratios,
font styles, lighting, perspective distortion, orientation, etc.
As the orientation information is useful for scene text
recognition and other tasks, scene text detection is different
from common object detection tasks that the text
orientation should be also be predicted in addition to the
axis-aligned bounding box information.
While most previous text detection methods are designed
for detecting horizontal or near-horizontal texts
[2,3,4,5,6,7,8,9,10,11,12,13,14], some methods try to
address the arbitrary-oriented text detection problem
[15,16,17,18,19,20,31,32,33,34]. Recently, arbitrary-
oriented scene text detection is a hot research area, which
can be seen from the frequent result updates in ICDAR2015
Robust Reading competition in incidental scene text
(a) (b)
(c) (d)
Fig. 1. The procedure of the proposed method R
2
CNN. (a)
Original input image; (b) text regions (axis-aligned bounding
boxes) generated by RPN; (c) predicted axis-aligned boxes and
inclined minimum area boxes (each inclined box is associated
with an axis-aligned box, and the associated box pair is indicated
by the same color); (d) detection result after inclined
non-maximum suppression.
detection [21]. While traditional text detection methods are
based on sliding-window or Connected Components (CCs)
[2,3,4,6,10,13,17,18,19,20], deep learning based methods
have been widely studied recently [7,8,9,12,15,16,31,32,
33,34].
This paper presents a Rotational Region CNN (R
2
CNN)
for detecting arbitrary-oriented scene texts. It is based on
Faster R-CNN architecture [1]. Figure 1 shows the
procedure of the proposed method. Figure 1(a) is the
original input image. We first use the RPN to propose
axis-aligned bounding boxes that enclose the texts (Figure
1(b)). Then we classify the proposals, refine the
axis-aligned boxes and predict the inclined minimum area
boxes with pooled features of different pooled sizes (Figure
1(c)). At last, inclined non-maximum suppression is used to
post-process the detection candidates to get the final
detection results (Figure 1(d)). Our method yields an
F-measure of 82.54% on ICDAR 2015 incidental text
detection benchmark and 87.73% on ICDAR 2013 focused
text detection benchmark.
R
2
CNN: Rotational Region CNN for Orientation Robust Scene Text Detection
Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang, Wei Li, Hua Wang, Pei Fu and Zhenbo Luo
Samsung R&D Institute China - Beijing
{yy.jiang, xiangyu.zhu, x0106.wang, shuli.yang, wei2016.li, hua00.wang, pei.fu, zb.luo}@samsung.com
arXiv:1706.09579 v2 [cs.CV] 30 June 2017