2.1.4 Cascades
Traditional one-class object de tection pipelines resorted to boosting like ap-
proaches for improving the performance where uncorre lated weak classifie rs
(better than random chance but not too corr e lated with the true predictions )
are combined to form a strong classifier. With modern CNNs, as the classifiers
are quite strong , the attractiveness of those methods has plummeted. How-
ever, for some specific problems where there are still too many false positives,
resear chers still find it useful. Furthermore, if the weak CNNs used are very
shallow it can also sometimes increase the overall spe e d of the method.
One of the first ideas that were developed was to cascade multiple CNNs.
Li et al. [2015] and Yang and Nevatia [2016] both used a three-staged approach
by chaining three CNNs for face detection. The former approach scanned the
image using a 12×12 patch CNN to reject 90% o f the non-face regions in a coarse
manner. The remaining detections were offset by a s e cond CNN and given as
input to a 24 × 24 CNN that continued rejecting false positives and refining
regres sions. The final ca ndidates were then passed on to a 48 × 48 cla ssification
network which o utput the final score. The latter approach created separate
score maps for different resolutions us ing the s ame FCN on different scales of
the test image (image pyramid). These score maps were then up-sampled to
the same re solution and added to create a final score map, which was then used
to select proposals. Proposals were then passed to the s e cond stage where two
different verification CNNs, trained on hard examples, e radicated the remaining
false p ositives. The first one being a four-layer FCN trained from scratch and
the second one an AlexNet [Krizhevsky et al., 2012] pre-trained on ImageNet.
All the approaches mentioned in the last paragraph are ad ho c: the CNNs
are independent of each o ther, there is no overall design, therefo re, they could
bene fit from integrating the elegant zoo ming module that is the RoI-Pooling.
The RoI-Pooling can act like a glue to pass the detections from one network to
the other, while doing the down-sampling operation locally. Dai et al. [2016a]
used a Mask R-CNN like structure that first propos e d bounding boxes, then
predicted a mask and us ed a third stage to pe rform fine grained discrimination
on masked regions that are RoI-Pooled a second time.
Ouyang et al. [2017], Wang et al. [2017a] optimized in an end-to-end manner
a Faster R-CNN with multiple stages of RoI-Pooling. Each stage accepted only
the highest scored proposals from the previous stage and added mo re context
and/or loc alized the detection better. Then additional informatio n about con-
text was used to do fine grained discrimination between hard negatives and true
positives in [Ouyang et al., 2017], for example. On the contrary, Zhang et al.
[2016a] showed that for pedestrian detection RoI-Po oling, too coarse a feature
map actually hurts the result. This problem has been alleviated by the use of
feature pyramid networks with higher resolutio n feature maps . Therefore, they
used the RPN pro posals of a Faster R-CNNN in a boosting pip e line involving
a forest (Tang et al. [2017c] acted simila rly for small vehicle detection).
Yang et al. [2016a], aware of the problem raised by Zhang et al. [2016a], used
RoI-Pooling on multiple scaled feature maps of all the layers of the network.
The classification function on e ach layer was le arned using the weak classifiers
of AdaBoost and then approximated using a fully connec ted neural network.
While all the mentioned pipelines are hard cascades where the different classi-
fiers are independent, it is sometimes pos sible to use a soft cascade where the
17