![](https://csdnimg.cn/release/download_crawler_static/10681503/bg1.jpg)
Mask R-CNN
Kaiming He Georgia Gkioxari Piotr Doll
´
ar Ross Girshick
Facebook AI Research (FAIR)
Abstract
We present a conceptually simple, flexible, and general
framework for object instance segmentation. Our approach
efficiently detects objects in an image while simultaneously
generating a high-quality segmentation mask for each in-
stance. The method, called Mask R-CNN, extends Faster
R-CNN by adding a branch for predicting an object mask in
parallel with the existing branch for bounding box recogni-
tion. Mask R-CNN is simple to train and adds only a small
overhead to Faster R-CNN, running at 5 fps. Moreover,
Mask R-CNN is easy to generalize to other tasks, e.g., al-
lowing us to estimate human poses in the same framework.
We show top results in all three tracks of the COCO suite of
challenges, including instance segmentation, bounding-box
object detection, and person keypoint detection. Without
tricks, Mask R-CNN outperforms all existing, single-model
entries on every task, including the COCO 2016 challenge
winners. We hope our simple and effective approach will
serve as a solid baseline and help ease future research in
instance-level recognition. Code will be made available.
1. Introduction
The vision community has rapidly improved object de-
tection and semantic segmentation results over a short pe-
riod of time. In large part, these advances have been driven
by powerful baseline systems, such as the Fast/Faster R-
CNN [9, 29] and Fully Convolutional Network (FCN) [24]
frameworks for object detection and semantic segmenta-
tion, respectively. These methods are conceptually intuitive
and offer flexibility and robustness, together with fast train-
ing and inference time. Our goal in this work is to develop a
comparably enabling framework for instance segmentation.
Instance segmentation is challenging because it requires
the correct detection of all objects in an image while also
precisely segmenting each instance. It therefore combines
elements from the classical computer vision tasks of ob-
ject detection, where the goal is to classify individual ob-
jects and localize each using a bounding box, and semantic
segmentation, where the goal is to classify each pixel into
RoIAlign
class
box
conv
conv
Figure 1. The Mask R-CNN framework for instance segmentation.
a fixed set of categories without differentiating object in-
stances.
1
Given this, one might expect a complex method
is required to achieve good results. However, we show that
a surprisingly simple, flexible, and fast system can surpass
prior state-of-the-art instance segmentation results.
Our method, called Mask R-CNN, extends Faster R-CNN
[29] by adding a branch for predicting segmentation masks
on each Region of Interest (RoI), in parallel with the ex-
isting branch for classification and bounding box regres-
sion (Figure 1). The mask branch is a small FCN applied
to each RoI, predicting a segmentation mask in a pixel-to-
pixel manner. Mask R-CNN is simple to implement and
train given the Faster R-CNN framework, which facilitates
a wide range of flexible architecture designs. Additionally,
the mask branch only adds a small computational overhead,
enabling a fast system and rapid experimentation.
In principle Mask R-CNN is an intuitive extension of
Faster R-CNN, yet constructing the mask branch properly
is critical for good results. Most importantly, Faster R-CNN
was not designed for pixel-to-pixel alignment between net-
work inputs and outputs. This is most evident in how
RoIPool [14, 9], the de facto core operation for attending
to instances, performs coarse spatial quantization for fea-
ture extraction. To fix the misalignment, we propose a sim-
ple, quantization-free layer, called RoIAlign, that faithfully
preserves exact spatial locations. Despite being a seem-
1
Following common terminology, we use object detection to denote
detection via bounding boxes, not masks, and semantic segmentation to
denote per-pixel classification without differentiating instances. Yet we
note that instance segmentation is both semantic and a form of detection.
2017 IEEE International Conference on Computer Vision
2380-7504/17 $31.00 © 2017 IEEE
DOI 10.1109/ICCV.2017.322
2980