端到端弱监督目标检测：克服局部极小值与精准定位

需积分: 10 91 浏览量更新于2024-09-02 收藏 7.88MB PDF 举报

本文主要探讨了在弱监督目标检测领域中的挑战和创新方法。弱监督目标检测是指在训练数据中缺乏每个对象实例级别的类别标注，这使得精确预测物体位置变得困难。传统的解决策略通常采用两阶段学习过程：首先通过多重实例学习（Multiple Instance Learning, MIL）的方法来识别出包含目标的候选区域，然后在第二阶段利用全监督学习（Fully Supervised Learning）和边界框回归技术对这些候选区域进行细化。然而，现有的两阶段方法可能存在一个问题：在从MIL阶段到全监督阶段的过程中，可能会陷入局部最优解，特别是在处理某些特定对象类别时。为了解决这个问题，论文提出了一种端到端（End-to-End）的学习策略。研究者设计了一个单一网络结构，其中包含了多重实例学习分支和边界框回归分支，它们共享同一基础模型，从而能够更好地协同工作。这种设计旨在通过集成两种不同的任务，提高整体性能并减少潜在的局部最优陷阱。为了进一步增强特征提取，作者还引入了一个引导注意力模块（Guided Attention Module），该模块利用分类损失作为指导，有效地从特征中提取出隐含的位置信息。这意味着网络不仅关注分类，还能同时优化位置信息的捕捉，从而提高检测精度。实验结果显示，与传统方法相比，该方法在公共数据集上取得了显著的性能提升，证明了其在弱监督环境下对于精确目标定位的有效性。总结来说，这篇论文的关键贡献在于提出了一种端到端的弱监督目标检测网络，它通过整合多重实例学习和边界框回归，以及引入引导注意力机制，解决了由于缺乏实例级标注而可能导致的局部最优问题。这种创新方法在实际应用中展现出更好的性能，为弱监督目标检测领域的研究提供了新的思路和改进方案。

2.2. Weakly Supervised Object Detection

Most existing methods formulate weakly-supervised de-

tection as a multiple instance learning problem [1, 32, 13,

18, 22, 27]. These approaches divided training images into

positive and negative parts, where each image is considered

as a bag of candidate object instances. If an image is an-

notated as a positive sample of a speciﬁc object class, at

least one proposal instance of the image belongs to this

class. The main task of MIL-based detectors is to learn

the discriminative representation of the object instances and

then select them from positive images to train a detec-

tor. Previous works on applying MIL to WSOD can be

roughly categorized into multi-phase learning approach

[18, 4, 22, 38, 30, 42, 43, 41] and end-to-end learning ap-

proach [1, 39, 34, 19, 33].

End-to-end learning approaches combine CNNs and

MIL into a uniﬁed network to address weakly supervised

object detection task. Diba et al. [5] proposed an end-

to-end cascaded convolutional network to perform weakly

supervised object detection and segmentation in cascaded

manner. Bilen et al. [1] developed a two-stream weakly su-

pervised deep detection network (WSDDN), which selected

the positive samples by aggregating the score of classiﬁ-

cation stream and detection stream. Based on WSDDN,

Kantorov et al. [19] proposed to learn a context-aware

CNN with contrast-based contextual modeling. Also based

on WSDDN, Tang et al. [34] designed an online instance

classiﬁer reﬁnement (OICR) algorithm to alleviate the lo-

cal optimum problem. Tang et al. [33] also proposed Pro-

posal Cluster Learning (PCL) to improve the performance

of OICR. Following the inspiration of [19] and [5], Wei et

al. [39] proposed a tight box mining method that leverages

surrounding segmentation context derived from weakly-

supervised segmentation to suppress low quality distracting

candidates and boost the high-quality ones. Recently, Tang

et al. [35] proposed a weakly supervised region proposal

network to generate more precise proposals for detection.

Positive object instances often focus on the most discrimi-

native parts of an object (e.g. the head of a cat, etc.) but

not the whole object, which leads to inferior performance

of weakly supervised detectors.

Multi-phase learning approaches ﬁrst employ MIL to se-

lect the best object candidate proposals, then use these se-

lected proposals as pseudo GT annotations for learning the

fully supervised object detector such as R-CNN [10] or

Fast(er) R-CNN [9, 26]. Li et al. [22] proposed classi-

ﬁcation adaptation to ﬁne-tune the network to collect class

speciﬁc object proposals, and detection adaptation was used

to optimize the representations for the target domain by the

conﬁdent object candidates. Cinbis et al. [4] proposed a

multi-fold MIL detector by re-labeling proposals and re-

training the object classiﬁer iteratively to prevent the detec-

tor from being locked into wrong object locations. Jie et al.

[18] proposed a self-taught learning approach to progres-

sively harvest high-quality positive instances. Zhang et al.

[43] proposed pseudo ground-truth excavation (PGE) algo-

rithm and pseudo groundtruth adaptation (PGA) algorithm

to reﬁne the pseudo ground-truth obtained by [34]. Wan et

al. [38] proposed a min-entropy latent model (MELM) and

recurrent learning algorithm for weakly supervised object

detection. Ge et al. [8] proposed to fuse and ﬁlter object in-

stances from different techniques and perform pixel label-

ing with uncertainty and they used the resulting pixelwise

labels to generate groundtruth bounding boxes for object

detection and attention maps for multi-label classiﬁcation.

Zhang et al. [42] proposed a Multi-view Learning Local-

ization Network (ML-LocNet) by incorporating multiview

learning into a two-phase WSOD model. However, multi-

phase learning WSOD is a non-convex optimization prob-

lem, which makes such approaches trapped in local optima.

In this paper, we consider the MIL (positive object can-

didates mining) and regression (object candidates localiza-

tion reﬁnement) problems simultaneously. We follow the

MIL pipeline and combine the two-stream WSDDN [1]

and OICR/PCL algorithms [34, 33] to implement our basic

MIL branch and reﬁne the detected boxes with a regression

branch in an online manner.

2.3. Attention Module

Attention modules were ﬁrst used in the natural lan-

guage processing ﬁeld and then introduced to the com-

puter vision area. Attention can be seen as a method

of biasing the allocation of available computational re-

sources towards the most informative components of a sig-

nal [15, 16, 25, 21, 37, 24, 14].

The current attention modules can be divided into two

categories: spatial attention and channel-wise attention.

Spatial attention is to assign different weights to different

spatial regions depending on their feature content. It au-

tomatically predicts the weighted heat map to enhance the

relevant features and suppress the irrelevant features during

the training process of a speciﬁc task. Spatial attention has

been used in image captioning [40], multi-label classiﬁca-

tion [45], pose estimation [3] and so on. Hu et al. [14]

proposed an Squeeze-and-Excitation block which models

channel-wise attention in a computationally efﬁcient man-

ner. In this paper, we use a combination of spatial and

channel-wise attention, and our attention module is guided

by object category.

3. Method

In this section we introduce proposed weakly supervised

object detection network, which consists of three major

components: guided attention module (GAM), MIL branch

and regression branch. The overall architecture of proposed

network is shown in Figure 3. Given an input image, an en-

剩余11页未读，继续阅读

andeyeluguo

粉丝: 724

端到端弱监督目标检测：克服局部极小值与精准定位

ICCV19_Tutorial_MSBrown.pdf

ICCV2019.zip

ICCV2019oral.pdf

ICCV 2019 Adaptive Wing Loss.pdf

ICCV2017.zip

ICCV2015.zip

ICCV2013.zip

TS_ICCV_2019_paper.pdf

17 ICCV Rank IQA.pdf

Learning Compact Geometric Features (ICCV 2017).pdf

最新资源