深度卷积网络的弱监督与半监督学习

下载需积分: 9 | PDF格式 | 2.49MB | 更新于2024-09-08 | 171 浏览量 | 举报

"这篇论文探讨了在弱监督和半监督学习环境下深度卷积网络（Deep Convolutional Neural Networks, DCNNs）在语义图像分割中的应用。通过使用少量强标注数据和大量弱标注数据，如边界框或图像级标签，来训练DCNNs，以实现对语义图像分割任务的高效学习。文中提出了基于期望最大化（Expectation-Maximization, EM）的方法来优化在这种条件下的模型训练，并通过实验验证了这些技术在处理具有挑战性的PASCAL VOC 2012图像分割任务时，能够学习到的模型具有竞争力的结果。" 在这篇研究中，作者关注的是如何在有限或者不完全的标注数据情况下，有效地训练深度卷积神经网络进行语义图像分割。语义图像分割是一种计算机视觉任务，其目标是将图像分成多个类别，每个像素都被分配一个特定的语义标签。传统的深度学习方法通常需要大量的像素级标注数据，这在实际操作中往往非常耗时且昂贵。弱监督学习指的是利用较少或非精确的标注信息进行模型训练。在本文中，弱标注数据可以是边界框，它仅指示对象的大致位置，或者图像级标签，它只告诉模型图像中存在哪些对象，但不提供具体的位置信息。尽管这些信息不如像素级标注完整，但它们能覆盖更大的数据集，从而提供了更广泛的学习机会。半监督学习则结合了少量的强标注数据（像素级标注）和大量的弱标注数据。这种方法旨在通过利用未标注数据的潜在结构来增强模型的泛化能力。论文中提出的EM算法被用来在这些条件下优化模型训练。EM算法是一种迭代方法，通过交替估计模型参数（E步骤）和最大化似然性（M步骤）来逐步改进模型。实验部分，作者在PASCAL VOC 2012数据集上测试了他们的方法。这个数据集是图像分割任务的一个基准，包含了多种类别的物体，提供了各种挑战，如重叠对象、复杂背景等。结果显示，尽管在有限的强标注和大量的弱标注数据下训练，所提出的模型仍然能够在图像分割任务中取得与全监督学习相当的性能。这篇论文为在现实世界中解决语义图像分割问题提供了一种新的视角，尤其是在标注数据有限的情况下。它强调了弱监督和半监督学习在深度学习模型训练中的潜力，以及如何通过EM算法有效利用这些不完整的标注信息。这对于未来在计算机视觉领域的研究，特别是数据标注成本高昂的情况，有着重要的指导意义。

Algorithm 1 Weakly-Supervised EM (ﬁxed bias version)

Input: Initial CNN parameters θ

, potential parameters b

l ∈ {0, . . . , L}, image x, image-level label set z.

E-Step: For each image position m

(l) = f

(l|x; θ

) + b

, if z

= 1

(l) = f

(l|x; θ

), if z

= 0

3: ˆy

= argmax

(l)

M-Step:

4: Q(θ; θ

) = log P (

y|x, θ) =

m=1

log P (ˆy

|x, θ)

5: Compute ∇

Q(θ; θ

) and use SGD to update θ

have the following probabilistic graphical model:

P (x, y, z; θ) = P (x)

m=1

P (y

|x; θ)

P (z|y) . (3)

We pursue an EM-approach in order to learn the model

parameters θ from training data. If we ignore terms that do

not depend on θ, the expected complete-data log-likelihood

given the previous parameter estimate θ

Q(θ; θ

) =

P (y|x, z; θ

) log P (y|x; θ) ≈ log P (

y|x; θ) ,

(4)

where we adopt a hard-EM approximation, estimating in the

E-step of the algorithm the latent segmentation by

y = argmax

P (y|x; θ

)P (z|y) (5)

= argmax

log P (y|x; θ

) + log P (z|y) (6)

= argmax

m=1

|x; θ

) + log P (z|y)

.(7)

In the M-step of the algorithm, we optimize Q(θ; θ

) ≈

log P (

y|x; θ) by mini-batch SGD similarly to (1), treating

y as ground truth segmentation.

To completely identify the E-step (7), we need to specify

the observation model P (z|y). We have experimented with

two variants, EM-Fixed and EM-Adapt.

EM-Fixed In this variant, we assume that log P (z|y) fac-

torizes over pixel positions as

log P (z|y) =

m=1

φ(y

, z) + (const) , (8)

allowing us to estimate the E-step segmentation at each

pixel separately

ˆy

= argmax

)

= f

|x; θ

) + φ(y

, z) . (9)

Image

Image annotations

Score maps

Weakly-Supervised E-step

FG/BG

Bias

argmax

1. Cat

2. Person

3. Plant

4. Sofa

Deep Convolutional

Neural Network

Loss

Figure 2. DeepLab model training using image-level labels.

We assume that

φ(y

= l, z) =



if z

= 1

0 if z

= 0

(10)

We set the parameters b

= b

, if l > 0 and b

= b

with b

> b

> 0. Intuitively, this potential encourages a

pixel to be assigned to one of the image-level labels z. We

choose b

> b

, boosting present foreground classes more

than the background, to encourage full object coverage and

avoid a degenerate solution of all pixels being assigned to

background. The procedure is summarized in Algorithm 1

and illustrated in Fig. 2.

EM-Adapt In this method, we assume that log P (z|y) =

φ(y, z) + (const), where φ(y, z) takes the form of a cardi-

nality potential [23, 33, 36]. In particular, we encourage at

least a ρ

portion of the image area to be assigned to class

l, if z

= 1, and enforce that no pixel is assigned to class

l, if z

= 0. We set the parameters ρ

= ρ

, if l > 0 and

= ρ

. Similar constraints appear in [10, 20].

In practice, we employ a variant of Algorithm 1. We

adaptively set the image- and class-dependent biases b

as the prescribed proportion of the image area is assigned to

the background or foreground object classes. This acts as a

powerful constraint that explicitly prevents the background

score from prevailing in the whole image, also promoting

higher foreground object coverage. The detailed algorithm

is described in the supplementary material.

EM vs. MIL It is instructive to compare our EM-based

approach with two recent Multiple Instance Learning (MIL)

methods for learning semantic image segmentation models

[31, 32]. The method in [31] deﬁnes an MIL classiﬁcation

objective based on the per-class spatial maximum of the lo-

cal label distributions of (2),

P (l|x; θ)

= max

P (y

l|x; θ), and [32] adopts a softmax function. While this

approach has worked well for image classiﬁcation tasks

[29, 30], it is less suited for segmentation as it does not pro-

mote full object coverage: The DCNN becomes tuned to

focus on the most distinctive object parts (e.g., human face)

instead of capturing the whole object (e.g., human body).

剩余11页未读，继续阅读

林语微光

粉丝: 936

深度卷积网络的弱监督与半监督学习

CLAM:完整幻灯片图像上的高效数据和弱监督计算病理学

基于FlowS-Unet的遥感图像建筑物变化检测

clinical-grade-computational-pathology-using-weakly-supervised-deep-learning-on-whole-slide-images:首个临床等级的AI系统可用

matlab终止以下代码-NetVLAD-CNN-architecture-for-weakly-supervised-place-recog

DeepEM-for-Weakly-Supervised-Detection:MICCAI18 DeepEM

Deep Learning for Weakly-Supervised Object Detection and Object

Cross-Domain Weakly-Supervised Object Detectio.pdf

Deep Residual Learning for Weakly-Supervised Relation Extraction

Task-4-Large-scale-weakly-supervised-sound-event-detection-for-smart-cars:任务4：智能汽车的大型弱监督声音事件检测

星图识别matlab代码-Awesome-Weakly-Supervised-Temporal-Action-Localization:精选的

最新资源