Deep Learning for Generic Object Detection: A Survey 7
The three operations that are repeatedly applied by a typical CNN
are illustrated in Fig. 8 (a). DCNNs having a large number of lay-
ers, a “deep” network, are referred to as Deep CNNs (DCNNs) and
a typical DCNN architecture illustrated in Fig. 8 (b).
As can be seen from Fig. 8 (b), each layer of a CNN consists of
a number of feature maps, within which each pixel acts like a neu-
ron. Each neuron in a convolutional layer is connected to feature
maps of the previous layer through a set of weights (essentially
a filter). As can be seen in Fig. 8 (b), early layers in a CNN are
typically composed of convolutional and pooling layers. The later
layers are normally fully connected layers. Some sort of nonlinear-
ity is normally present between each pair of layers.
From earlier to later layers, the input image repeatedly under-
goes convolution, and with each layer the receptive field (the re-
gion of support) increases. In general, the initial CNN layers ex-
tract low-level features (e.g., edges), with later layers extracting
features of increasing complexity [296, 13, 145, 195].
DCNNs have a number of outstanding advantages: a hierarchi-
cal structure to learn representations of data with multiple levels
of abstraction, the capacity to learn very complex functions, and
learning feature representations directly and automatically from
data with minimal domain knowledge. What has particularly made
DCNNs feasible has been the availability of large scale labeled
datasets and of GPUs with very high computational capability.
Despite the great successes, known deficiencies remain. In par-
ticular, there is an extreme need for labeled training data, there is
a requirement of expensive computing resources, and considerable
skill and experience are still needed to select appropriate learning
parameters and network architecture. Trained networks are poorly
interpretable, there is a lack of robustness to image transformations
and degradations, and many DCNNs have shown serious vulnera-
bility to attacks, all of which currently limit the use of DCNNs in
many real world applications.
4 Datasets and Performance Evaluation
4.1 Datasets
Datasets have played a key role throughout the history of object
recognition research, not only as a common ground for measuring
and comparing the performance of competing algorithms, but also
pushing the field towards increasingly complex and challenging
problems. In particular, with deep learning techniques recently rev-
olutionizing many visual recognition problems, it is large amounts
of annotated data which play a key role in their success. The present
access to large numbers of images on the Internet makes it possible
to build comprehensive datasets of increasing numbers of images
and categories in order to capture an ever greater richness and di-
versity of objects, enabling unprecedented performance in object
recognition.
For generic object detection, there are four famous datasets:
PASCAL VOC [66, 67], ImageNet [52], MS COCO [162] and
Open Images [139]. Attributes of these datasets are summarized
in Table 3, and selected sample images are shown in Fig. 9. There
are three steps to creating large-scale annotated datasets: determin-
ing the set of target object categories, collecting a diverse set of
candidate images to represent the selected categories on the Inter-
Table 2 Most frequent object classes for each detection challenge. The size of
each word is proportional to the frequency of that class in the training dataset.
(a) PASCAL VOC (20 Classes) (b) MS COCO (80 Classes)
(c) ILSVRC (200 Classes)
(d) Open Images Detection Challenge (500 Classes)
net, and annotating the large amount of collected images, typically
by designing crowdsourcing strategies (the most challenging step).
Recognizing space limitations, we refer interested readers to the
original papers [66, 67, 162, 230, 139] for detailed description of
these datasets in terms of construction and properties.
The four datasets form the backbone of their respective de-
tection challenges. Each challenge consists of a publicly available
dataset of images together with ground truth annotation and stan-
dardized evaluation software, and an annual competition and corre-
sponding workshop. Statistics for the number of images and object
instances in the training, validation and testing datasets
2
for the
detection challenges are given in Table 4. The most frequent ob-
ject classes in VOC, COCO, ILSVRC and Open Images detection
datasets are visualized in Table 2.
PASCAL VOC [66, 67] is a multiyear effort devoted to the
creation and maintenance of a series of benchmark datasets for
classification and object detection, creating the precedent for stan-
dardized evaluation of recognition algorithms in the form of an-
nual competitions. Starting from only four categories in 2005, the
dataset has increased to 20 categories that are common in everyday
life, as shown in Fig. 9.
2
The annotations on the test set are not publicly released, except for PAS-
CAL VOC2007.