Mix-and-Match Tuning for Self-Supervised Semantic Segmentation
Xiaohang Zhan Ziwei Liu Ping Luo Xiaoou Tang Chen Change Loy
Department of Information Engineering, The Chinese University of Hong Kong
{zx017, lz013, pluo, xtang, ccloy}@ie.cuhk.edu.hk
Abstract
Deep convolutional networks for semantic image segmen-
tation typically require large-scale labeled data, e.g., Ima-
geNet and MS COCO, for network pre-training. To reduce
annotation efforts, self-supervised semantic segmentation is
recently proposed to pre-train a network without any human-
provided labels. The key of this new form of learning is to
design a proxy task (e.g., image colorization), from which
a discriminative loss can be formulated on unlabeled data.
Many proxy tasks, however, lack the critical supervision
signals that could induce discriminative representation for
the target image segmentation task. Thus self-supervision’s
performance is still far from that of supervised pre-training.
In this study, we overcome this limitation by incorporating a
‘mix-and-match’ (M&M) tuning stage in the self-supervision
pipeline. The proposed approach is readily pluggable to many
self-supervision methods and does not use more annotated
samples than the original process. Yet, it is capable of
boosting the performance of target image segmentation task
to surpass fully-supervised pre-trained counterpart. The im-
provement is made possible by better harnessing the limited
pixel-wise annotations in the target dataset. Specifically, we
first introduce the ‘mix’ stage, which sparsely samples and
mixes patches from the target set to reflect rich and diverse
local patch statistics of target images. A ‘match’ stage then
forms a class-wise connected graph, which can be used to
derive a strong triplet-based discriminative loss for fine-
tuning the network. Our paradigm follows the standard prac-
tice in existing self-supervised studies and no extra data or
label is required. With the proposed M&M approach, for the
first time, a self-supervision method can achieve comparable
or even better performance compared to its ImageNet pre-
trained counterpart on both PASCAL VOC2012 dataset and
CityScapes dataset.
Introduction
Semantic image segmentation is a classic computer vision
task that aims at assigning each pixel in an image with a class
label such as “chair”, “person”, and “dog”. It enjoys a wide
spectrum of applications, such as scene understanding (Li,
Socher, and Fei-Fei 2009; Lin et al. 2014; Li et al. 2017b)
and autonomous driving (Geiger et al. 2013; Cordts et al.
2016; Li et al. 2017a). Deep convolutional neural network
Copyright
c
2018, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
(CNN) is now the state-of-the-art technique for semantic
image segmentation (Long, Shelhamer, and Darrell 2015;
Liu et al. 2015; Zhao et al. 2017; Liu et al. 2017). The
excellent performance, however, comes with a price of
expensive and laborious label annotations. In most existing
pipelines, a network is usually first pre-trained on millions
of class-labeled images, e.g., ImageNet (Russakovsky et al.
2015) and MS COCO (Lin et al. 2014), and subsequently
fine-tuned with thousands of pixel-wise annotated images.
Self-supervised learning
1
is a new paradigm proposed
for learning deep representations without extensive anno-
tations. This new technique has been applied to the task
of image segmentation (Zhang, Isola, and Efros 2016a;
Larsson, Maire, and Shakhnarovich 2016; 2017). In general,
self-supervised image segmentation can be divided into two
stages: the proxy stage, and the fine-tuning stage. The proxy
stage does not need any labeled data but requires one to
design a proxy or pretext task with self-derived supervisory
signals on unlabeled data. For instance, learning by coloriza-
tion (Larsson, Maire, and Shakhnarovich 2017) utilizes the
fact that a natural image is composed of luminance channel
and chrominance channels. The proxy task is formulated
with cross-entropy loss to predict an image chrominance
from the luminance of the same image. In the fine-tuning
stage, the learned representations are utilized to initialize the
target semantic segmentation network. The network is then
fine-tuned with pixel-wise annotations. It has been shown
that without large-scale class-labeled pre-training, semantic
image segmentation could still gain encouraging perfor-
mance over random initialization or from-scratch training.
Though promising, the performance of self-supervised
learning is still far from that achieved by supervised
pre-training. For instance, a VGG-16 network trained
with the self-supervised method of (Larsson, Maire, and
Shakhnarovich 2017) achieves a 56.0% mean Intersection
over Union (mIoU) on PASCAL VOC 2012 segmentation
benchmark (Everingham et al. 2010), higher than a random
initialized network that only yields 35.0% mIoU. How-
ever, an identical network trained on ImageNet achieves
64.2% mIoU. There exists a considerable gap between self-
supervised and pure supervised pre-training.
We believe that the performance discrepancy is mainly
1
Project page: http://mmlab.ie.cuhk.edu.hk/projects/M&M/
arXiv:1712.00661v3 [cs.CV] 30 Jan 2018