Multi-Label Classification with Label Graph Superimposing
Ya Wang
$∗
, Dongliang He
‡∗
, Fu Li
‡
, Xiang Long
‡
, Zhichao Zhou
‡
, Jinwen Ma
$†
, Shilei Wen
‡
$
School of Mathematical Sciences and LMAM, Peking University, China
‡
Department of Computer Vision Technology (VIS), Baidu Inc., Beijing, China
{wangyachn@, jwma@math}.pku.edu.cn {hedongliang01, lifu, longxiang, zhouzhichao01, wenshilei}@baidu.com
Abstract
Images or videos always contain multiple objects or ac-
tions. Multi-label recognition has been witnessed to achieve
pretty performance attribute to the rapid development of deep
learning technologies. Recently, graph convolution network
(GCN) is leveraged to boost the performance of multi-label
recognition. However, what is the best way for label corre-
lation modeling and how feature learning can be improved
with label system awareness are still unclear. In this paper,
we propose a label graph superimposing framework to im-
prove the conventional GCN+CNN framework developed for
multi-label recognition in the following two aspects. Firstly,
we model the label correlations by superimposing label graph
built from statistical co-occurrence information into the graph
constructed from knowledge priors of labels, and then multi-
layer graph convolutions are applied on the final superim-
posed graph for label embedding abstraction. Secondly, we
propose to leverage embedding of the whole label system
for better representation learning. In detail, lateral connec-
tions between GCN and CNN are added at shallow, mid-
dle and deep layers to inject information of label system
into backbone CNN for label-awareness in the feature learn-
ing process. Extensive experiments are carried out on MS-
COCO and Charades datasets, showing that our proposed so-
lution can greatly improve the recognition performance and
achieves new state-of-the-art recognition performance.
Introduction
Multi-label is a natural property of images or videos, it is
usually the case that a image or video contains multiple ob-
jects or actions. In the computer vision community, multi-
label recognition is a fundamental and practical task, and has
attracted increasing research efforts. Given the great suc-
cess of single label image/video classification brought by
deep convolutional networks (He et al. 2015; Carreira and
Zisserman 2017; He et al. 2016a; Feichtenhofer et al. 2018;
Wu et al. 2019), multi-label recognition can achieve pretty
performance by naively treating each label as an indepen-
dent individual and applying multiple binary classification
∗
equal contribution. This work was done when Ya Wang was a
full-time research intern at Baidu.
†
Corresponding author
Copyright
c
2020, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
! = 0.42
Sports Ball
Sports Ball,
Tenni s Racket
(a) Examples on MS-COCO
! = 0.20
Sitting on Couch
Sitting on Couch,
Watching Te le vision
(b) Examples on Charades
Figure 1: Examples of label relationship in multi-label
datasets. (a) illustrates the co-occurrence of “Sports Ball”
and “Tennis Racket” on the MS-COCO datasets, we can see
the frequency that “Tennis Racket” co-occurs with “Sports
Ball” is as high as 0.42. Similarly, (b) showcases an exam-
ple of “Sitting on Couch” and “Watching Television” from
the Charades dataset.
to predict whether a label presents or not. However, we ar-
gue that the following two aspects should be taken into con-
sideration for such a task.
First of all, labels co-occur in images or videos with pri-
ors. As illustrated in Figure 1, with great chance, “Sports
Ball” comes together with “Tennis Racket” and a man “Sit-
ting on Couch” is “Watching Television” simultaneously.
Then, a question is naturally raised, how to model the re-
arXiv:1911.09243v1 [cs.CV] 21 Nov 2019