Deep Learning for Generic Object Detection: A Survey 5
achieving two competing goals: high quality/accuracy and high ef-
ficiency, as illustrated in Fig. 4. As illustrated in Fig. 5, high qual-
ity detection has to accurately localize and recognize objects in
images or video frames, such that the large variety of object cate-
gories in the real world can be distinguished (i.e., high distinctive-
ness), and that object instances from the same category, subject to
intraclass appearance variations, can be localized and recognized
(i.e., high robustness). High efficiency requires the entire detec-
tion task to run at a sufficiently high frame rate with acceptable
memory and storage usage. Despite several decades of research
and significant progress, arguably the combined goals of accuracy
and efficiency have not yet been met.
2.2.1 Accuracy related challenges
For accuracy, the challenge stems from 1) the vast range of intra-
class variations and 2) the huge number of object categories.
We begin with intraclass variations, which can be divided into
two types: intrinsic factors, and imaging conditions. For the for-
mer, each object category can have many different object instances,
possibly varying in one or more of color, texture, material, shape,
and size, such as the “chair” category shown in Fig. 5 (h). Even in
a more narrowly defined class, such as human or horse, object in-
stances can appear in different poses, with nonrigid deformations
and different clothes.
For the latter, the variations are caused by changes in imag-
ing conditions and unconstrained environments which may have
dramatic impacts on object appearance. In particular, different in-
stances, or even the same instance, can be captured subject to a
wide number of differences: different times, locations, weather
conditions, cameras, backgrounds, illuminations, viewpoints, and
viewing distances. All of these conditions produce significant vari-
ations in object appearance, such as illumination, pose, scale, oc-
clusion, background clutter, shading, blur and motion, with exam-
ples illustrated in Fig. 5 (a-g). Further challenges may be added by
digitization artifacts, noise corruption, poor resolution, and filter-
ing distortions.
In addition to intraclass variations, the large number of object
categories, on the order of 10
4
−10
5
, demands great discrimination
power of the detector to distinguish between subtly different inter-
class variations, as illustrated in Fig. 5 (i)). In practice, current de-
tectors focus mainly on structured object categories, such as the 20,
200 and 91 object classes in PASCAL VOC [53], ILSVRC [179]
and MS COCO [129] respectively. Clearly, the number of object
categories under consideration in existing benchmark datasets is
much smaller than that can be recognized by humans.
2.2.2 Efficiency related challenges
The exponentially increasing number of images calls for efficient
and scalable detectors. The prevalence of social media networks
and mobile/wearable devices has led to increasing demands for
analyzing visual data. However mobile/wearable devices have lim-
ited computational capabilities and storage space, in which case an
efficient object detector is critical.
For efficiency, the challenges stem from the need to localize
and recognize all object instances of very large number of object
categories, and the very large number of possible locations and
实现两个相互竞争的目标:如图4所示,高质量、高准确性和高效率。
如图5所示,高质量检测必须准确地定位和识别图像或视频帧中的物体
,这样才能区分真实世界中各种各样的对象类别(例如高度的区别性),
以及来自同一类别的对象实例,受限于类内外观的变化,可以被本地化和
识别(例如高鲁棒性)
高效率要求整个检测任务以足够高的帧速率运行,并使用可接受的内存和
存储使用。
尽管经过了几十年的研究和取得了重大进展,但准确和效率的综合目标还
没有得到满足。
准确的说,精准度的挑战来自于大量的类内变化和大量的对象类别的挑战。
我们从细胞内的变化开始,可以分为两种类型:内在因素和成像条件。
对于前者,每个对象类别可以有许多不同的对象实例,可能存在一个或多个不
同的颜色,质地,材料,形状,大小,如椅子类别图5所示(h)。
即使在一个更加狭义的类中,如人或马,对象实例可能有着不同的姿势,可能有
非刚性变形,也可能穿着不同的衣服。
对于后者,这些变化是由成像条件的变化和不受约束的环境造成的,这可能
会对物体的外观产生巨大的影响。
特别地,不同的实例,甚至是相同的实例,都可以被捕获到不同的地方:
不同的时间、地点、天气条件、摄像机、背景、光照、视点和观看距离。所
有这些条件都会产生显著的物体外观变化,如光照、姿势、尺度、遮挡、背
景杂波、阴影、模糊和运动,如图5(a-g)所示。
数字化的人工品、噪音的干扰、糟糕的解决方案和过多的畸变而失真,可能
会增加更多的挑战。
除了类内的变化外,对象类别的数量巨大大约在10000- 100000,要求
探测器有很大的辨别能力,以区分细微不同的类间变化,如图5(i)所
示)。
在实践中,当前的检测器主要关注结构化的对象类别,例如PASCAL
VOC [53]、ILSVRC [179]和COCO [129]的对象类。
显然,现有基准数据集中所考虑的对象类别的数量远远小于人类可识别
的对象类别。
为了提高效率,挑战源于需要本地化和识别大量对象类别的所有对象
实例,以及单个图像中可能的大量位置和
指数增加的图像数量需要有效且可扩展的检测器。
社交媒体网络和移动/可穿戴设备的普及导致对分析视觉数据的需求不断增加
。
然而,移动/可穿戴设备具有有限的计算能力和存储空间,
在这种情况下,
有效的物体检测器是关键的。