![](https://csdnimg.cn/release/download_crawler_static/12252102/bg7.jpg)
OKSUZ et al.: IMBALANCE PROBLEMS IN OBJECT DETECTION: A REVIEW 7
Number of Anchors for Bg and Fg
166827.50
163.04
Background Foreground
Classes
0
0.5
1
1.5
2
Number of Anchors
10
5
(a)
Number of Anchors for Fg Classes
0 10 20 30 40 50 60 70 80
Foreground Classes
0
10
20
30
40
50
Number of Anchors
(b)
Fig. 4: Illustration of the class imbalance problems. The numbers of RetinaNet [22] anchors on MS-COCO [90] are plotted
for foreground-background (a), and foreground classes (b). The values are normalized with the total number of images in
the dataset. The figures depict severe imbalance towards some classes.
Solutions. We can group the solutions for the foreground-
background class imbalance into four: (i) hard sampling
methods, (ii) soft sampling methods, (iii) sampling-free
methods and (iv) generative methods. Each set of methods
are explained in detail in the subsections below.
In sampling methods, the contribution (w
i
) of a bound-
ing box (BB
i
) to the loss function is adjusted as follows:
w
i
CE(p
s
), (2)
where CE() is the cross-entropy loss. Hard and soft sam-
pling approaches differ on the possible values of w
i
. For
the hard sampling approaches, w
i
∈ {0, 1}, thus a BB is
either selected or discarded. For soft sampling approaches,
w
i
∈ [0, 1], i.e. the contribution of a sample is adjusted with
a weight and each BB is somehow included in training.
4.1.1 Hard Sampling Methods
Hard sampling is a commonly-used method for addressing
imbalance in object detection. It restricts w
i
to be binary; i.e.,
0 or 1. In other words, it addresses imbalance by selecting
a subset of positive and negative examples (with desired
quantities) from a given set of labeled BBs. This selection
is performed using heuristic methods and the non-selected
examples are ignored for the current iteration. Therefore,
each sampled example contributes equally to the loss (i.e.
w
i
= 1) and the non-selected examples (w
i
= 0) have no
contribution to the training for the current iteration. See
Table 3 for a summary of the main approaches.
A straightforward hard-sampling method is random
sampling. Despite its simplicity, it is employed in R-CNN
family of detectors [16], [21] where, for training RPN, 128
positive examples are sampled uniformly at random (out
of all positive examples) and 128 negative anchors are
sampled in a similar fashion; and 16 positive examples and
48 negative RoIs are sampled uniformly from each image
in the batch at random from within their respective sets,
for training the detection network [17]. In any case, if the
number of positive input bounding boxes is less than the
desired values, the mini-batch is padded with randomly
sampled negatives. On the other hand, it has been reported
that other sampling strategies may perform better when a
property of an input box such as its loss value or IoU is
taken into account [24], [29], [30].
The first set of approaches to consider a property of
the sampled examples, rather than random sampling, is
the Hard-example mining methods
4
. These methods rely
on the hypothesis that training a detector more with hard
examples (i.e. examples with high losses) leads to better
performance. The origins of this hypothesis go back to the
bootstrapping idea in the early works on face detection [55],
[94], [95], human detection [96] and object detection [13].
The idea is based on training an initial model using a subset
of negative examples, then using the negative examples on
which the classifier fails (i.e. hard examples), a new classifier
is trained. Multiple classifiers are obtained by applying the
same procedure iteratively. Currently deep-learning-based
methods also adopt some versions of the hard example min-
ing in order to provide more useful examples by using the
loss values of the examples. The first deep object detector to
use hard examples in the training was Single-Shot Detector
[19], which chooses only the negative examples incurring
the highest loss values. A more systematic approach con-
sidering the loss values of positive and negative samples
is proposed in Online Hard Example Mining (OHEM) [24].
However, OHEM needs additional memory and causes the
training speed to decrease. Considering the efficiency and
memory problems of OHEM, IoU-based sampling [29] was
proposed to associate the hardness of the examples with
their IoUs and to use a sampling method again for only
negative examples rather than computing the loss function
for the entire set. In the IoU-based sampling, the IoU interval
for the negative samples is divided into K bins and equal
number of negative examples are sampled randomly within
each bin to promote the samples with higher IoUs, which
are expected to have higher loss values.
To improve mining performance, several studies pro-
posed to limit the search space in order to make hard
examples easy to mine. Two-stage object detectors [18], [21]
are among these methods since they aim to find the most
probable bounding boxes (i.e. RoIs) given anchors, and then
choose top N RoIs with the highest objectness scores, to
which an additional sampling method is applied. Fast R-
CNN [17] sets the lower bound of IoU of the negative RoIs
4. In this paper, we adopt the boldface font whenever we introduce
an approach involving a set of different methods, and the method
names themselves are in italic.