IncepText: A New Inception-Text Module with Deformable PSROI Pooling for
Multi-Oriented Scene Text Detection
Qiangpeng Yang, Mengli Cheng, Wenmeng Zhou, Yan Chen, Minghui Qiu, Wei Lin, Wei Chu
Alibaba Group
{qiangpeng.yqp, mengli.cml, wenmeng.zwm, chenyan.cy, minghui.qmh, weilin.lw, weichu.cw}@alibaba-inc.com
Abstract
Incidental scene text detection, especially for multi-
oriented text regions, is one of the most challenging
tasks in many computer vision applications. Differ-
ent from the common object detection task, scene
text often suffers from a large variance of aspect
ratio, scale, and orientation. To solve this prob-
lem, we propose a novel end-to-end scene text de-
tector IncepText from an instance-aware segmenta-
tion perspective. We design a novel Inception-Text
module and introduce deformable PSROI pooling
to deal with multi-oriented text detection. Exten-
sive experiments on ICDAR2015, RCTW-17, and
MSRA-TD500 datasets demonstrate our method’s
superiority in terms of both effectiveness and ef-
ficiency. Our proposed method achieves 1st place
result on ICDAR2015 challenge and the state-of-
the-art performance on other datasets. Moreover,
we have released our implementation as an OCR
product which is available for public access.
1
1 Introduction
Scene text detection is one of the most challenging tasks
in many computer vision applications such as multilingual
translation, image retrieval, and automatic driving. The first
challenge is scene text contains various kinds of images, such
as street views, posters, menus, indoor scenes, etc. Further-
more, the scene text has large variations in both foreground
texts and background objects, and also with various lighting,
burring, and orientation.
In the past years, there have been many outstanding ap-
proaches focus on scene text detection. The key point of
text detection is to design features to distinguish text and
non-text regions. Most of the traditional methods such as
MSER
[
Neumann and Matas, 2010
]
and FASText
[
Busta et
al., 2015
]
use manually designed text features. These meth-
ods are not robust enough to handle complex scene text. Re-
cently, Convolutional Neural Network (CNN) based methods
achieve the state-of-the-art results in text detection and recog-
nition
[
He et al., 2016b; Tian et al., 2016; Zhou et al., 2017;
1
https://market.aliyun.com/products/
57124001/cmapi020020.html
He et al., 2017
]
. CNN based models have a powerful capa-
bility of feature representation, and deeper CNN models are
able to extract higher level or abstract features.
In the literature, there are mainly two types of approaches
for scene text detection, namely indirect and direct regres-
sions. Indirect regression methods predict the offsets from
some box proposals, such as CTPN
[
Tian et al., 2016
]
and
RRPN
[
Ma et al., 2017
]
. These methods are based on Faster-
RCNN
[
Ren et al., 2015
]
framework. Recently, direct regres-
sion methods have achieved high performance for scene text
detection, e.g. East
[
Zhou et al., 2017
]
and DDR
[
He et al.,
2017
]
. Direct regression usually performs boundary regres-
sion by predicting the offsets from a given point.
In this paper, we solve this problem from an instance-aware
segmentation perspective that mainly draws on the experience
of FCIS
[
Li et al., 2016
]
. Different from common object
detection, scene text often suffers from a large variance of
scale, aspect ratio, and orientation. Therefore, we design a
novel Inception-Text module to deal with these challenges.
This module is inspired by Inception module
[
Szegedy et al.,
2015
]
in GoogLeNet, we choose multi branches of different
convolution kernels to deal with the text of different aspect
ratios and scales. At the end of each branch, we add a de-
formable convolution layer to adapt multi orientations. An-
other improvement is that we replace the PSROI pooling in
FCIS with deformable PSROI pooling
[
Dai et al., 2017a
]
.
According to our experiments, deformable PSROI pooling
has better performance in the classification task.
Our main contributions can be summarized as follows:
• We propose a new Inception-Text module for multi-
oriented scene text detection. According to our exper-
iments, this module shows a significant increase in ac-
curacy with little computation cost.
• We propose to use deformable PSROI pooling module
to deal with multi-oriented text. The qualitative study
of learned offset parts in deformable PSROI pooling and
quantitive evaluations show its efficiency to handle arbi-
trary oriented scene text.
• We evaluate our proposed method on three public
datasets ICDAR2015, RCTW-17 and MSRA-TD500,
and show that our proposed method achieves the state-
of-the-art performance on several benchmarks without
using any extra data.
arXiv:1805.01167v2 [cs.CV] 8 May 2018