8
Dataset Year Description #Cites
ICDAR [71] 2003 ICDAR2003 is one of the first public datasets for text detection. ICDAR 2015
and 2017 are other popular iterations of the ICDAR challenge [72, 73]. url: http:
//rrc.cvc.uab.es/
530
STV [74] 2010 Consists of ∼350 images and ∼720 text instances taken from Google StreetView.
url: http://tc11.cvc.uab.es/datasets/SVT 1
339
MSRA-TD500
[75]
2012 Consists of ∼500 indoor/outdoor images with Chinese and English texts. url:
http://www.iapr-tc11.org/mediawiki/index.php/MSRA Text Detection 500
Database (MSRA-TD500)
413
IIIT5k [76] 2012 Consists of ∼1,100 images and ∼5,000 words from both streets and born-digital
images. url: http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html
165
Syn90k [77] 2014 A synthetic dataset with 9 million images generated from a 90,000 vocabulary of
multiple fonts. url: http://www.robots.ox.ac.uk/
∼
vgg/data/text/
246
COCOText
[78]
2016 The largest text detection dataset so far. Built based on MS-COCO, Consists
of ∼63,000 images and ∼173,000 text annotations. https://bgshih.github.io/
cocotext/.
69
TABLE 4
An overview of some popular scene text detection datasets.
Dataset Year Description #Cites
TLR [79] 2009 Captured by a moving vehicle in Paris. Consists of ∼11,000 video frames
and ∼9,200 traffic light instances. url: http://www.lara.prd.fr/benchmarks/
trafficlightsrecognition
164
LISA [80] 2012 One of the first traffic sign detection dataset. Consists of ∼6,600 video
frames, ∼7,800 instances of 47 US signs. url: http://cvrr.ucsd.edu/LISA/
lisa-traffic-sign-dataset.html
325
GTSDB [81] 2013 One of the most popular traffic signs detection dataset. Consists of ∼900 images
with ∼1,200 traffic signs capture with various weather conditions during differ-
ent time of a day. url: http://benchmark.ini.rub.de/?section=gtsdb&subsection=
news
259
BelgianTSD
[82]
2012 Consists of ∼7,300 static images, ∼120,000 video frames, and ∼11,000 traffic sign
annotations of 269 types. The 3D location of each sign has been annotated. url:
https://btsd.ethz.ch/shareddata/
224
TT100K [83] 2016 The largest traffic sign detection dataset so far, with ∼100,000 images (2048 x
2048) and ∼30,000 traffic sign instances of 128 classes. Each instance is annotated
with class label, bounding box and pixel mask. url: http://cg.cs.tsinghua.edu.cn/
traffic%2Dsign/
111
BSTL [84] 2017 The largest traffic light detection dataset. Consists of ∼5000 static images, ∼8300
video frames, and ∼24000 traffic light instances. https://hci.iwr.uni-heidelberg.
de/node/6132
21
TABLE 5
An overview of some popular traffic light detection and traffic sign detection datasets.
tion problems. Therefore, machine learning based detection
methods were beginning to prosper.
Machine learning based detection has gone through mul-
tiple periods, including the statistical models of appearance
(before 1998), wavelet feature representations (1998-2005),
and gradient-based representations (2005-2012).
Building statistical models of an object, like Eigenfaces
[95, 106] as shown in Fig 5 (a), was the first wave of learning
based approaches in object detection history. In 1991, M.
Turk et al. achieved real-time face detection in a lab envi-
ronment by using Eigenface decomposition [95]. Compared
with the rule-based or template based approaches of its
time [107, 108], a statistical model better provides holistic
descriptions of an object’s appearance by learning task-
specific knowledge from data.
Wavelet feature transform started to dominate visual
recognition and object detection since 2000. The essence of
this group of methods is learning by transforming an image
from pixels to a set of wavelet coefficients. Among these
methods, the Haar wavelet, owing to its high computational
efficiency, has been mostly used in many object detection
tasks, such as general object detection [29], face detection
[10, 11, 109], pedestrian detection [30, 31], etc. Fig 5 (d)
shows a set of Haar wavelets basis learned by a VJ detector
[10, 11] for human faces.
• Early time’s CNN for object detection
The history of using CNN to detecting objects can be
自2000年以来,小波特征变换开始主导视觉识
别和目标检测。这组方法的本质是通过将图像
从像素点转换为一组小波系数来学习。其中,
Haar小波由于其计算效率高,被广泛应用于一
般目标检测[29]、人脸检测[10,11,109],行人检
测[30,31]等目标检测任务中。图 (d)为VJ检测器
学习到的一组用于人脸的Haar小波基[10, 11]。
但在更复杂的检测问题上,事情进展得并不顺利。
因此,基于机器学习的检测方法开始蓬勃发展。
基于机器学习的检测经历了包括外观统计模型在内
的多个阶段 ( 1998年以前 ) 、小波特征表示
( 1998-2005 ) 和基于梯度的表示 ( 2005-2012 )。
建立对象的统计模型,比如特征面(Eigenfaces)
[95,106]如图 (a)所示,是目标检测历史上第一波基
于学习的方法。
1991年,M.Turk等人利用特征脸分解技术在实验室
环境中实现了实时人脸检测[95]。与当时基于规则
或模板的方法相比[107,108],统计模型通过从数据
中学习特定于任务的知识,更好地提供了对象外观
的整体描述。