Panoptic Feature Pyramid Networks
Alexander Kirillov Ross Girshick Kaiming He Piotr Doll
´
ar
Facebook AI Research (FAIR)
Abstract
The recently introduced panoptic segmentation task has
renewed our community’s interest in unifying the tasks of
instance segmentation (for thing classes) and semantic seg-
mentation (for stuff classes). However, current state-of-
the-art methods for this joint task use separate and dis-
similar networks for instance and semantic segmentation,
without performing any shared computation. In this work,
we aim to unify these methods at the architectural level,
designing a single network for both tasks. Our approach
is to endow Mask R-CNN, a popular instance segmenta-
tion method, with a semantic segmentation branch using
a shared Feature Pyramid Network (FPN) backbone. Sur-
prisingly, this simple baseline not only remains effective for
instance segmentation, but also yields a lightweight, top-
performing method for semantic segmentation. In this work,
we perform a detailed study of this minimally extended ver-
sion of Mask R-CNN with FPN, which we refer to as Panop-
tic FPN, and show it is a robust and accurate baseline for
both tasks. Given its effectiveness and conceptual simplic-
ity, we hope our method can serve as a strong baseline and
aid future research in panoptic segmentation.
1. Introduction
Our community has witnessed rapid progress in seman-
tic segmentation, where the task is to assign each pixel a
class label (e.g. for stuff classes), and more recently in in-
stance segmentation, where the task is to detect and segment
each object instance (e.g. for thing classes). These advances
have been aided by simple yet powerful baseline methods,
including Fully Convolutional Networks (FCN) [39] and
Mask R-CNN [23] for semantic and instance segmentation,
respectively. These methods are conceptually simple, fast,
and flexible, serving as a foundation for much of the sub-
sequent progress in these areas. In this work our goal is
to propose a similarly simple, single-network baseline for
the joint task of panoptic segmentation [29], a task which
encompasses both semantic and instance segmentation.
While conceptually straightforward, designing a sin-
gle network that achieves high accuracy for both tasks is
(a) Feature Pyramid Network
(b) Instance Segmentation Branch (c) Semantic Segmentation Branch
Figure 1: Panoptic FPN: (a) We start with an FPN back-
bone [34], widely used in object detection, for extracting
rich multi-scale features. (b) As in Mask R-CNN [23],
we use a region-based branch on top of FPN for instance
segmentation. (c) In parallel, we add a lightweight dense-
prediction branch on top of the same FPN features for se-
mantic segmentation. This simple extension of Mask R-
CNN with FPN is a fast and accurate baseline for both tasks.
challenging as top-performing methods for the two tasks
have many differences. For semantic segmentation, FCNs
with specialized backbones enhanced by dilated convolu-
tions [55, 10] dominate popular leaderboards [17, 14]. For
instance segmentation, the region-based Mask R-CNN [23]
with a Feature Pyramid Network (FPN) [34] backbone
has been used as a foundation for all top entries in re-
cent recognition challenges [35, 58, 41]. While there have
been attempts to unify semantic and instance segmentation
[44, 1, 9], the specialization currently necessary to achieve
top performance in each was perhaps inevitable given their
parallel development and separate benchmarks.
Given the architectural differences in these top methods,
one might expect compromising accuracy on either instance
or semantic segmentation is necessary when designing a
single network for both tasks. Instead, we show a simple,
flexible, and effective architecture that can match accuracy
for both tasks using a single network that simultaneously
generates region-based outputs (for instance segmentation)
and dense-pixel outputs (for semantic segmentation).
1
arXiv:1901.02446v1 [cs.CV] 8 Jan 2019