Semantic Edge Detection with Diverse Deep Supervision
Yun Liu
1
, Ming-Ming Cheng
1
, JiaWang Bian
2
, Le Zhang
3
, Peng-Tao Jiang
1
, Yang Cao
1
1
Nankai University
2
University of Adelaide
3
Advanced Digital Sciences Center
ABSTRACT
Semantic edge detection (SED), which aims at jointly extracting
edges as well as their category information, has far-reaching appli-
cations in domains such as semantic segmentation, object proposal
generation, and object recognition. SED naturally requires achiev-
ing two distinct supervision targets: locating ne detailed edges
and identifying high-level semantics. We shed light on how such
distracted supervision targets prevent state-of-the-art SED meth-
ods from eectively using deep supervision to improve results. In
this paper, we propose a novel fully convolutional neural network
architecture using diverse deep supervision (
DDS
) within a multi-
task framework where lower layers aim at generating category-
agnostic edges, while higher layers are responsible for the detec-
tion of category-aware semantic edges. To overcome the distracted
supervision challenge, a novel information converter unit is in-
troduced, whose eectiveness has been extensively evaluated in
several popular benchmark datasets, including SBD, Cityscapes,
and PASCAL VOC2012. Source code will be released upon paper
acceptance.
KEYWORDS
Semantic edge detection, diverse deep supervision
1 INTRODUCTION
Classical edge detection aims to detect edges and objects’ bound-
aries. It is
category-agnostic
in the sense that recognizing object
categories is not necessary. It can be viewed as a pixel-wise binary
classication problem whose objective is to classify each pixel as
belonging to one class, indicating the edge, or to the other class, indi-
cating non-edge. In this paper we consider more practical scenarios
of semantic edge detection, in which the detection of edges and the
recognition of edges’ categories within an image is jointly achieved.
Semantic edge detection (
SED
) [
4
,
14
,
30
,
41
] is an active research
topic in computer vision due to its wide-ranging applications in
problems such as object proposal generation [
4
], occlusion and
depth reasoning [
1
,
17
], 3D reconstruction [
33
], object detection
[11, 12], image-based localization [32] and so on.
Recently, deep convolutional neural networks (DCNNs) reign
undisputed as the new de-facto method for category-agnostic edge
detection [
29
,
38
] where near human-level performances have been
achieved. Deep learning for
category-aware
SED, which jointly
detects visually salient edges as well as recognizes their categories,
however, is not yet to witness such vast popularity. Hariharan et
al. [
14
] rst combined generic object detectors with bottom-up
edges to recognize semantic edges. A fully convolutional encoder-
decoder network is proposed in [
39
] to detect object contours but
without recognizing specic categories. Recently, CASENet [
41
]
introduces a skip-layer structure to enrich category-wise edge ac-
tivations with bottom layer features, improving previous state-of-
the-art methods with a signicant margin.
(a) original image (b) ground truth
Person
Motorbike
Person+Motorbike
(c) color codes
(d) Side-1 (e) Side-2 (f) Side-3
(g) Side-4 (h) Side-5 (i) DDS
Figure 1: An example of our DDS algorithm. (a) shows the
original image from the SBD dataset. (b)-(c) show its seman-
tic edge map and corresponding color co des. (d)-(g) display
category-agnostic edges from Side-1-4. (h)-(i) show semantic
edges of Side-5 and DDS output, respectively.
Distracted supervision paradox in SED.
SED naturally requires
achieving two distinct supervision targets: i) locating ne detailed
edges by capturing discontinuity among image regions, mainly
using low-level features; and ii) identifying abstracted high-level
semantics by summarizing dierent appearance variations of the
target categories. Such distracted supervision paradox prevents the
state-of-the-art SED method, i.e. CASENet [
41
], from successfully
applying deep supervision, whose eectiveness has been demon-
strated in a wide number of other computer vision tasks, e.g. image
categorization [
36
], object detection [
26
], visual tracking [
37
], and
category-agnostic edge detection [29, 38].
In this paper, we propose a diverse deep supervision (
DDS
) method,
which employs deep supervision with dierent loss functions for
high-level and low-level feature learning as shown in Fig. 2(b).
While mainly using high-level convolution (i.e.
conv
) features for
semantic classication and low-level conv ones for non-semantic
edge details is intuitive and straightforward, directly doing this as in
CASENet [
41
] results in even worse performance than directly learn-
ing semantic edges without deep supervision or category-agnostic
edge guidance. In [
41
], Yu et al. claimed that deep supervision for
lower layers of the network is not necessary, after unsuccessfully
trying various ways of adding deep supervision. As illustrated in
Fig. 2(b), we propose an
information converter
unit for changing
the backbone DCNN features into dierent representations, for
training category-agnostic or semantic edges respectively. Without
arXiv:1804.02864v1 [cs.CV] 9 Apr 2018