
x4
Classification
H
x
W
x
C
Center-ness
H
x
W
x
1
x4
H
x
W
x256
Regression
H
x
W
x
4
Shared Heads Between Feature Levels
C5
C4
C3
P7
P6
P5
P4
P3
Head
Backbone Feature Pyramid
Head
Head
Head
Head
Classification + Center-ness + Regression
100x128 /8
50x64 /16
25x32 /32
1 3x1 6 /64
7x8 /128
H
x
W
/
s
800x1 024
H
x
W
x256
H
x
W
x256
H
x
W
x256
Figure 2 – The network architecture of FCOS, where C3, C4, and C5 denote the feature maps of the backbone network and P3 to P7 are
the feature levels used for the final prediction. H × W is the height and width of feature maps. ‘/s’ (s = 8, 16, ..., 128) is the down-
sampling ratio of the feature maps at the level to the input image. As an example, all the numbers are computed with an 800 × 1024
input.
to be carefully tuned in order to achieve good perfor-
mance. Besides the above hyper-parameters describing an-
chor shapes, the anchor-based detectors also need other
hyper-parameters to label each anchor box as a positive,
ignored or negative sample. In previous works, they of-
ten employ intersection over union (IOU) between anchor
boxes and ground-truth boxes to determine the label of an
anchor box (e.g., a positive anchor if its IOU is in [0.5, 1]).
These hyper-parameters have shown a great impact on the
final accuracy, and require heuristic tuning. Meanwhile,
these hyper-parameters are specific to detection tasks, mak-
ing detection tasks deviate from a neat fully convolutional
network architectures used in other dense prediction tasks
such as semantic segmentation.
Anchor-free Detectors. The most popular anchor-free
detector might be YOLOv1 [21]. Instead of using anchor
boxes, YOLOv1 predicts bounding boxes at points near
the center of objects. Only the points near the center are
used since they are considered to be able to produce higher-
quality detection. However, since only points near the cen-
ter are used to predict bounding boxes, YOLOv1 suffers
from low recall as mentioned in YOLOv2 [22]. As a result,
YOLOv2 [22] employs anchor boxes as well. Compared to
YOLOv1, FCOS takes advantages of all points in a ground
truth bounding box to predict the bounding boxes and the
low-quality detected bounding boxes are suppressed by the
proposed “center-ness” branch. As a result, FCOS is able to
provide comparable recall with anchor-based detectors as
shown in our experiments.
CornerNet [13] is a recently proposed one-stage anchor-
free detector, which detects a pair of corners of a bound-
ing box and groups them to form the final detected bound-
ing box. CornerNet requires much more complicated post-
processing to group the pairs of corners belonging to the
same instance. An extra distance metric is learned for the
purpose of grouping.
Another family of anchor-free detectors such as [32] are
based on DenseBox [12]. The family of detectors have been
considered unsuitable for generic object detection due to
difficulty in handling overlapping bounding boxes and the
recall being relatively low. In this work, we show that both
problems can be largely alleviated with multi-level FPN
prediction. Moreover, we also show together with our pro-
posed center-ness branch, the much simpler detector can
achieve even better detection performance than its anchor-
based counterparts.
3. Our Approach
In this section, we first reformulate object detection in
a per-pixel prediction fashion. Next, we show that how
we make use of multi-level prediction to improve the re-
call and resolve the ambiguity resulted from overlapped
bounding boxes. Finally, we present our proposed “center-
ness” branch, which helps suppress the low-quality detected
bounding boxes and improves the overall performance by a
large margin.
3.1. Fully Convolutional One-Stage Object Detector
Let F
i
∈ R
H×W ×C
be the feature maps at layer i of
a backbone CNN and s be the total stride until the layer.
The ground-truth bounding boxes for an input image are
3