Fast Interactive Object Annotation with Curve-GCN
Huan Ling
1,2∗
Jun Gao
1,2∗
Amlan Kar
1,2
Wenzheng Chen
1,2
Sanja Fidler
1,2,3
1
University of Toronto
2
Vector Institute
3
NVIDIA
{linghuan, jungao, amlan, wenzheng, fidler}@cs.toronto.edu
Abstract
Manually labeling objects by tracing their boundaries is
a laborious process. In [
7
,
2
], the authors proposed Polygon-
RNN that produces polygonal annotations in a recurrent
manner using a CNN-RNN architecture, allowing interac-
tive correction via humans-in-the-loop. We propose a new
framework that alleviates the sequential nature of Polygon-
RNN, by predicting all vertices simultaneously using a Graph
Convolutional Network (GCN). Our model is trained end-
to-end. It supports object annotation by either polygons or
splines, facilitating labeling efficiency for both line-based
and curved objects. We show that Curve-GCN outperforms
all existing approaches in automatic mode, including the
powerful PSP-DeepLab [
8
,
23
] and is significantly more effi-
cient in interactive mode than Polygon-RNN++. Our model
runs at 29.3ms in automatic, and 2.6ms in interactive mode,
making it 10x and 100x faster than Polygon-RNN++.
1. Introduction
Object instance segmentation is the problem of outlining
all objects of a given class in an image, a task that has been
receiving increased attention in the past few years [
15
,
36
,
20
,
3
,
21
]. Current approaches are all data hungry, and
benefit from large annotated datasets for training. However,
manually tracing object boundaries is a laborious process,
taking up to 40sec per object [
2
,
9
]. To alleviate this problem,
a number of interactive image segmentation techniques have
been proposed [
28
,
23
,
7
,
2
], speeding up annotation by a
significant factor. We follow this line of work.
In DEXTR [
23
], the authors build upon the Deeplab ar-
chitecture [
8
] by incorporating a simple encoding of human
clicks in the form of heat maps. This is a pixel-wise ap-
proach, i.e. it predicts a foreground-background label for
each pixel. DEXTR showed that by incorporating user clicks
as a soft constraint, the model learns to interactively improve
its prediction. Yet, since the approach is pixel-wise, the
worst case scenario still requires many clicks.
Polygon-RNN [
7
,
2
] frames human-in-the-loop annota-
tion as a recurrent process, during which the model sequen-
∗
authors contributed equally
Interactive Object Annotation Tool
Curve-GCN
Add box
Spline
Polygon
https://richardkleincpa.com/new-york-city-street-wallpaper/
Figure 1: We propose Curve-GCN for interactive object annota-
tion. In contrast to Polygon-RNN [
7
,
2
], our model parametrizes
objects with either polygons or splines and is trained end-to-end at
a high output resolution.
tially predicts vertices of a polygon. The annotator can
intervene whenever an error occurs, by correcting the wrong
vertex. The model continues its prediction by conditioning
on the correction. Polygon-RNN was shown to produce an-
notations at human level of agreement with only a few clicks
per object instance. The worst case scenario here is bounded
by the number of polygon vertices, which for most objects
ranges up to 30-40 points. However, the recurrent nature
of the model limits scalability to more complex shapes, re-
sulting in harder training and longer inference. Furthermore,
the annotator is expected to correct mistakes in a sequential
order, which is often challenging in practice.
In this paper, we frame object annotation as a regres-
sion problem, where the locations of all vertices are pre-
dicted simultaneously. We represent the object as a graph
with a fixed topology, and perform prediction using a Graph
Convolutional Network. We show how the model can be
used and optimized for interactive annotation. Our frame-
work further allows us to parametrize objects with either
polygons or splines, adding additional flexibility and effi-
ciency to the interactive annotation process. The proposed
approach, which we refer to as Curve-GCN, is end-to-end
differentiable, and runs in real time. We evaluate our Curve-
GCN on the challenging Cityscapes dataset [
10
], where we
outperform Polygon-RNN++ and PSP-Deeplab/DEXTR in
both automatic and interactive settings. We also show that
our model outperforms the baselines in cross-domain an-
notation, that is, a model trained on Cityscapes is used to
1
arXiv:1903.06874v1 [cs.CV] 16 Mar 2019