Robust salient object detection for RGB images
features for salient object detection. Zhang et al. [58]pro-
pose a controlled bi-directional passing of features between
shallow and deep layers to obtain accurate predictions. Deng
et al. [9] develop a recurrent residual refinement network
for saliency maps refinement by incorporating shallow and
deep layers features alternately. Qin et al. [36] propose a
end-to-end predict-refine architecture BASNet. It is able to
capture both large-scale and fine structures, e.g., thin regions,
holes, and produce salient object detection maps with clear
boundaries. Wu et al. [52] propose a cascaded partial decoder
framework, which discards low-level features to reduce the
complexity of deep aggregation models and utilizes gen-
erated relatively precise attention map to refine high-level
features to improve the performance. Li et al. [27]employ
a multi-scale cascade structure and a refinement module to
filter out errors. It better consolidates contextual information
and intermediate saliency priors.
Although aforementioned approaches employ powerful
CNNs and make remarkable success in salient object detec-
tion, they produce unsatisfactory results in dealing with
non-salient images problems as shown in Fig. 1. As a result,
there is still a large room for performance improvements.
2.2 Salient object existence prediction
Wang et al. [45] exploit hand-crafted global features from
multiple saliency information to directly predict the exis-
tence and the position of the salient object in web images by
random forest. The purpose of this work is different from ours
and focuses more on location of salient object whose result
is expressed by bounding box enclosing the salient object
region. Zhang et al. [56] investigate not only existence but
also counting the number of salient objects based on holistic
cues. If the image contains no salient objects, then an all-
black saliency map is generated directly while salient object
detection is not performed. Jiang et al. [19] propose a super-
vised learning approach for jointly addressing the salient
object detection and existence prediction problems by the
structural SVM framework to predict both image-level exis-
tence labels and pixel-level saliency values. Hou et al. [17]
predict the saliency existence of the input image by intro-
ducing another branch into salient object detection network
with short connections. It does not consider jointly training
of both tasks, but only improve the accuracy of salient object
existence prediction based on salient object detection net-
work.
In this paper, we focus on recognizing saliency existence
and locating salient objects. By incorporating image-level
label, better performance of salient object detection with
pixel level can be achieved.
2.3 Multi-task models
Salient object detection is to identify the most visually dis-
tinctive objects or regions in an image and then segment them
out from the background. Semantic segmentation, image
classification, salient object contour detection, subitizing and
salient object existence prediction are discussed to guide
salient object detection in recent years. Li et al. [26]set
up a multi-mask learning scheme for exploring the intrinsic
correlations between saliency detection and image seman-
tic segmentation. Cholakkal et al. [8] propose a framework
for top-down salient object detection that incorporates a
tightly coupled image classification module. Wang et al. [43]
propose to use image-level tags as weak supervision to
learn to predict pixel-level saliency maps solving the prob-
lem that train DNNs require costly pixel-level annotations.
Li et al. [23] propose to use the combination of a coarse
salient object activation map from the classification network
and saliency maps generated from unsupervised methods as
pixel-level annotation, to train fully convolutional networks
for salient object detection supervised by these noisy annota-
tions. Wang et al. [22] design a deep multi-scale refinement
network for both salient region detection and salient object
contour detection. Zhuge et al. [61] propose a fully con-
volutional networks to integrate multi-level convolutional
features recurrently with the guidance of object boundary
information. He et al. [14] detect salient objects with the aid
of subitizing. Jiang et al. [19] propose t o jointly train the
salient object detection and existence prediction problems
by the structural SVM framework. Li et al. [28] graft salient
object detection decoder onto the existing contour detection
network to form a multi-task network architecture without
using any manually labeled salient object masks. Although
it implements joint training of two tasks, both tasks belong to
the pixel-level segmentation task, lack of more multi-modal
information. Hou et al. [16] aim at solving pixel-wise binary
problems, including salient object detection, skeleton extrac-
tion and edge detection, by introducing a horizontal cascade
encoder architecture. But this general structure cannot han-
dle multiple tasks at the same time, and does not consider the
complementarity between multiple tasks. Wang et al. [50]
design a neural network that has two branches for attention
box prediction (ABP) and aesthetics assessment (AA) to crop
photograph with the best aesthetic quality. ABP subnetwork
is responsible for inferring the initial cropping, and the AA
network determines the final cropping. ABP task is followed
by AA task. These two tasks are not learned simultaneously.
They only share several convolutional blocks in the bottom
of network.
CNN model trained in end-to-end manner for both salient
object detection and salient object existence prediction tasks
is unexplored in aforementioned literature.
123