Object Center
Detection
(c) FCOS
(a) Faster R-CNN
Anchor
Proposal
Detection
Anchor
Detection
(b) RetinaNet
Corner Points Grouping
(d) CornerNet
Figure 2: Representation flows for several typical detection frameworks.
bounding box can be described by a
4
-d vector, either as center-size
(x
c
, y
c
, w, h)
or as opposing
corners
(x
tl
, y
tl
, x
br
, y
br
)
. Besides the final output, this representation is also commonly used as initial
and intermediate object representations, such as anchors [
24
,
20
,
22
,
23
,
18
] and proposals [
9
,
4
,
17
,
11
]. For bounding box representations, features are usually extracted by pooling operators within
the bounding box area on an image feature map. Common pooling operators include RoIPool [
8
],
RoIAlign [
11
], and Deformable RoIPool [
5
,
40
]. There are also simplified feature extraction methods,
e.g., the box center features are usually employed in the anchor box representation [24, 18].
Object center representation
The
4
-d vector space of a bounding box representation is at a scale
of
O(H
2
× W
2
)
for an image with resolution
H × W
, which is too large to fully process. To
reduce the representation space, some recent frameworks [
29
,
35
,
38
,
14
,
32
] use the center point
as a simplified representation. Geometrically, a center point is described by a 2-d vector (x
c
, y
c
), in
which the hypothesis space is of the scale
O(H × W )
, which is much more tractable. For a center
point representation, the image feature on the center point is usually employed as the object feature.
Corner representation
A bounding box can be determined by two points, e.g., a top-left corner and
a bottom-right corner. Some approaches [
30
,
15
,
16
,
7
,
21
,
39
,
26
] first detect these individual points
and then compose bounding boxes from them. We refer to these representation methods as corner
representation. The image feature at the corner location can be employed as the part feature.
Summary and comparison
Different representation approaches usually have strengths in different
aspects. For example, object based representations (bounding box and center) are better in category
classification while worse in object localization than part based representations (corners). Object
based representations are also more friendly for end-to-end learning because they do not require
a post-processing step to compose objects from corners as in part-based representation methods.
Comparing different object-based representations, while the bounding box representation enables
more sophisticated feature extraction and multiple-stage processing, the center representation is
attractive due to the simplified system design.
2.2 Object Detection Frameworks in a Representation View
Object detection methods can be seen as evolving intermediate object/part representations until the
final bounding box outputs. The representation flows largely shape different object detectors. Several
major categorization of object detectors are based on such representation flow, such as top-down
(object-based representation) vs bottom-up (part-based representation), anchor-based (bounding
box based) vs anchor-free (center point based), and single-stage (one-time representation flow) vs
multiple-stage (multiple-time representation flow). Figure 2 shows the representation flows of several
typical object detection frameworks, as detailed below.
Faster R-CNN
[
24
] employs bounding boxes as its intermediate object representations in all stages.
At the beginning, multiple anchor boxes at each feature map position are hypothesized to coarsely
cover the 4-d bounding box space in an image, i.e.,
3
anchor boxes with different aspect ratios.
The image feature vector at the center point is extracted to represent each anchor box, which is
then used for foreground/background classification and localization refinement. After anchor box
selection and localization refinement, the object representation is evolved to a set of proposal boxes,
where the object features are usually extracted by an RoIAlign operator within each box area. The
final bounding box outputs are obtained by localization refinement, through a small network on the
proposal features.
RetinaNet
[
18
] is a one-stage object detector, which also employs bounding boxes as its intermediate
representation. Due to its one-stage nature, it usually requires denser anchor hypotheses, i.e.,
9
anchor
boxes at each feature map position. The final bounding box outputs are also obtained by applying a
localization refinement head network.
3