4 Wang et al.
TensorMask [4] adopts the dense sliding window paradigm to segment the in-
stance in the local window for each pixel with a predefined number of windows
and scales. In contrast to the top-down methods above, our SOLO is totally box-
free thus not being restricted by (anchor) box locations and scales, and naturally
benefits from the inherent advantages of FCNs.
Bottom-up Instance Segmentation. This category of the approaches gen-
erate instance masks by grouping the pixels into an arbitrary number of object
instances presented in an image. In [22], pixels are grouped into instances us-
ing the learned associative embedding. A discriminative loss function [7] learns
pixel-level instance embedding efficiently, by pushing away pixels belonging to
different instances and pulling close pixels in the same instance. SGN [18] de-
composes the instance segmentation problem into a sequence of sub-grouping
problems. SSAP [8] learns a pixel-pair affinity pyramid, the probability that two
pixels belong to the same instance, and sequentially generates instances by a
cascaded graph partition. Typically bottom-up methods lag behind in accuracy
compared to top-down methods, especially on the dataset with diverse scenes.
Instead of exploiting pixel pairwise relations SOLO directly learns with the in-
stance mask annotations solely during training, and predicts instance masks
end-to-end without grouping post-processing.
Direct Instance Segmentation. To our knowledge, no prior methods directly
train with mask annotations solely, and predict instance masks and semantic cat-
egories in one shot without the need of grouping post-processing. Several recently
proposed methods may be viewed as the ‘semi-direct’ paradigm. AdaptIS [26]
first predicts point proposals, and then sequentially generates the mask for the
object located at the detected point proposal. PolarMask [28] proposes to use
the polar representation to encode masks and transforms per-pixel mask predic-
tion to distance regression. They both do not need bounding boxes for training
but are either being step-wise or founded on compromise, e.g., coarse parametric
representation of masks. Our SOLO takes an image as input, directly outputs
instance masks and corresponding class probabilities, in a fully convolutional,
box-free and grouping-free paradigm.
2 Our Method: SOLO
2.1 Problem Formulation
The central idea of SOLO framework is to reformulate the instance segmentation
as two simultaneous category-aware prediction problems. Concretely, our system
divides the input image into a uniform grids, i.e., S×S. If the center of an object
falls into a grid cell, that grid cell is responsible for 1) predicting the semantic
category as well as 2) segmenting that object instance.
Semantic Category For each grid, our SOLO predicts the C-dimensional out-
put to indicate the semantic class probabilities, where C is the number of classes.
These probabilities are conditioned on the grid cell. If we divide the input image