Feature Pyramid Networks for Object Detection
Tsung-Yi Lin
1,2
, Piotr Doll
´
ar
1
, Ross Girshick
1
,
Kaiming He
1
, Bharath Hariharan
1
, and Serge Belongie
2
1
Facebook AI Research (FAIR)
2
Cornell University and Cornell Tech
Abstract
Feature pyramids are a basic component in recognition
systems for detecting objects at different scales. But recent
deep learning object detectors have avoided pyramid rep-
resentations, in part because they are compute and memory
intensive. In this paper, we exploit the inherent multi-scale,
pyramidal hierarchy of deep convolutional networks to con-
struct feature pyramids with marginal extra cost. A top-
down architecture with lateral connections is developed for
building high-level semantic feature maps at all scales. This
architecture, called a Feature Pyramid Network (FPN),
shows significant improvement as a generic feature extrac-
tor in several applications. Using FPN in a basic Faster
R-CNN system, our method achieves state-of-the-art single-
model results on the COCO detection benchmark without
bells and whistles, surpassing all existing single-model en-
tries including those from the COCO 2016 challenge win-
ners. In addition, our method can run at 6 FPS on a GPU
and thus is a practical and accurate solution to multi-scale
object detection. Code will be made publicly available.
1. Introduction
Recognizing objects at vastly different scales is a fun-
damental challenge in computer vision. Feature pyramids
built upon image pyramids (for short we call these featur-
ized image pyramids) form the basis of a standard solution
[1] (Fig. 1(a)). These pyramids are scale-invariant in the
sense that an object’s scale change is offset by shifting its
level in the pyramid. Intuitively, this property enables a
model to detect objects across a large range of scales by
scanning the model over both positions and pyramid levels.
Featurized image pyramids were heavily used in the
era of hand-engineered features [5, 25]. They were so
critical that object detectors like DPM [7] required dense
scale sampling to achieve good results (e.g., 10 scales per
octave). For recognition tasks, engineered features have
(a) Featurized image pyramid
predict
predict
predict
predict
(b) Single feature map
predict
(d) Feature Pyramid Network
predict
predict
predict
(c) Pyramidal feature hierarchy
predict
predict
predict
Figure 1. (a) Using an image pyramid to build a feature pyramid.
Features are computed on each of the image scales independently,
which is slow. (b) Recent detection systems have opted to use
only single scale features for faster detection. (c) An alternative is
to reuse the pyramidal feature hierarchy computed by a ConvNet
as if it were a featurized image pyramid. (d) Our proposed Feature
Pyramid Network (FPN) is fast like (b) and (c), but more accurate.
In this figure, feature maps are indicate by blue outlines and thicker
outlines denote semantically stronger features.
largely been replaced with features computed by deep con-
volutional networks (ConvNets) [19, 20]. Aside from being
capable of representing higher-level semantics, ConvNets
are also more robust to variance in scale and thus facilitate
recognition from features computed on a single input scale
[15, 11, 29] (Fig. 1(b)). But even with this robustness, pyra-
mids are still needed to get the most accurate results. All re-
cent top entries in the ImageNet [33] and COCO [21] detec-
tion challenges use multi-scale testing on featurized image
pyramids (e.g., [16, 35]). The principle advantage of fea-
turizing each level of an image pyramid is that it produces
a multi-scale feature representation in which all levels are
semantically strong, including the high-resolution levels.
Nevertheless, featurizing each level of an image pyra-
mid has obvious limitations. Inference time increases con-
siderably (e.g., by four times [11]), making this approach
impractical for real applications. Moreover, training deep
1
arXiv:1612.03144v2 [cs.CV] 19 Apr 2017