
2.2. Weakly Supervised Object Detection
Most existing methods formulate weakly-supervised de-
tection as a multiple instance learning problem [1, 32, 13,
18, 22, 27]. These approaches divided training images into
positive and negative parts, where each image is considered
as a bag of candidate object instances. If an image is an-
notated as a positive sample of a specific object class, at
least one proposal instance of the image belongs to this
class. The main task of MIL-based detectors is to learn
the discriminative representation of the object instances and
then select them from positive images to train a detec-
tor. Previous works on applying MIL to WSOD can be
roughly categorized into multi-phase learning approach
[18, 4, 22, 38, 30, 42, 43, 41] and end-to-end learning ap-
proach [1, 39, 34, 19, 33].
End-to-end learning approaches combine CNNs and
MIL into a unified network to address weakly supervised
object detection task. Diba et al. [5] proposed an end-
to-end cascaded convolutional network to perform weakly
supervised object detection and segmentation in cascaded
manner. Bilen et al. [1] developed a two-stream weakly su-
pervised deep detection network (WSDDN), which selected
the positive samples by aggregating the score of classifi-
cation stream and detection stream. Based on WSDDN,
Kantorov et al. [19] proposed to learn a context-aware
CNN with contrast-based contextual modeling. Also based
on WSDDN, Tang et al. [34] designed an online instance
classifier refinement (OICR) algorithm to alleviate the lo-
cal optimum problem. Tang et al. [33] also proposed Pro-
posal Cluster Learning (PCL) to improve the performance
of OICR. Following the inspiration of [19] and [5], Wei et
al. [39] proposed a tight box mining method that leverages
surrounding segmentation context derived from weakly-
supervised segmentation to suppress low quality distracting
candidates and boost the high-quality ones. Recently, Tang
et al. [35] proposed a weakly supervised region proposal
network to generate more precise proposals for detection.
Positive object instances often focus on the most discrimi-
native parts of an object (e.g. the head of a cat, etc.) but
not the whole object, which leads to inferior performance
of weakly supervised detectors.
Multi-phase learning approaches first employ MIL to se-
lect the best object candidate proposals, then use these se-
lected proposals as pseudo GT annotations for learning the
fully supervised object detector such as R-CNN [10] or
Fast(er) R-CNN [9, 26]. Li et al. [22] proposed classi-
fication adaptation to fine-tune the network to collect class
specific object proposals, and detection adaptation was used
to optimize the representations for the target domain by the
confident object candidates. Cinbis et al. [4] proposed a
multi-fold MIL detector by re-labeling proposals and re-
training the object classifier iteratively to prevent the detec-
tor from being locked into wrong object locations. Jie et al.
[18] proposed a self-taught learning approach to progres-
sively harvest high-quality positive instances. Zhang et al.
[43] proposed pseudo ground-truth excavation (PGE) algo-
rithm and pseudo groundtruth adaptation (PGA) algorithm
to refine the pseudo ground-truth obtained by [34]. Wan et
al. [38] proposed a min-entropy latent model (MELM) and
recurrent learning algorithm for weakly supervised object
detection. Ge et al. [8] proposed to fuse and filter object in-
stances from different techniques and perform pixel label-
ing with uncertainty and they used the resulting pixelwise
labels to generate groundtruth bounding boxes for object
detection and attention maps for multi-label classification.
Zhang et al. [42] proposed a Multi-view Learning Local-
ization Network (ML-LocNet) by incorporating multiview
learning into a two-phase WSOD model. However, multi-
phase learning WSOD is a non-convex optimization prob-
lem, which makes such approaches trapped in local optima.
In this paper, we consider the MIL (positive object can-
didates mining) and regression (object candidates localiza-
tion refinement) problems simultaneously. We follow the
MIL pipeline and combine the two-stream WSDDN [1]
and OICR/PCL algorithms [34, 33] to implement our basic
MIL branch and refine the detected boxes with a regression
branch in an online manner.
2.3. Attention Module
Attention modules were first used in the natural lan-
guage processing field and then introduced to the com-
puter vision area. Attention can be seen as a method
of biasing the allocation of available computational re-
sources towards the most informative components of a sig-
nal [15, 16, 25, 21, 37, 24, 14].
The current attention modules can be divided into two
categories: spatial attention and channel-wise attention.
Spatial attention is to assign different weights to different
spatial regions depending on their feature content. It au-
tomatically predicts the weighted heat map to enhance the
relevant features and suppress the irrelevant features during
the training process of a specific task. Spatial attention has
been used in image captioning [40], multi-label classifica-
tion [45], pose estimation [3] and so on. Hu et al. [14]
proposed an Squeeze-and-Excitation block which models
channel-wise attention in a computationally efficient man-
ner. In this paper, we use a combination of spatial and
channel-wise attention, and our attention module is guided
by object category.
3. Method
In this section we introduce proposed weakly supervised
object detection network, which consists of three major
components: guided attention module (GAM), MIL branch
and regression branch. The overall architecture of proposed
network is shown in Figure 3. Given an input image, an en-