弱监督互补部分模型用于细粒度图像分类

8 浏览量更新于2023-10-19 收藏 12.75MB PDF 举报

细粒度分类

图像分类

身份认证购VIP最低享 7 折!

30元优惠券

Weifeng Ge1,2∗Xiangru Lin2∗Yizhou Yu1†130340自下而上的细粒度图像分类的弱监督互补部分模型01 Deepwise AI Lab 2 香港大学0摘要0鉴于训练数据集由图像和相应的类别标签组成，深度卷积神经网络在挖掘图像分类的区分性部分方面表现出很强的能力。然而，仅使用图像级别标签训练的深度卷积神经网络倾向于聚焦于最具有区分性的部分，而忽略其他物体部分，这些部分可能提供互补信息。在本文中，我们从不同的角度来解决这个问题。我们以弱监督的方式构建互补部分模型，以检索由卷积神经网络检测到的主导物体部分抑制的信息。仅使用图像级别标签，我们首先通过使用MaskR-CNN和基于CRF的分割进行弱监督的目标检测和实例分割来提取粗略的物体实例。然后，我们根据尽可能保留多样性的原则估计和搜索每个物体实例的最佳部分模型。在最后阶段，我们构建一个双向长短期记忆（LSTM）网络，将这些互补部分的部分信息融合和编码成综合的特征，用于图像分类。实验结果表明，所提出的方法不仅在基线模型上取得了显著的改进，而且在Stanford Dogs 120、Caltech-UCSDBirds 2011-200和Caltech256上也超过了现有算法的很大幅度（分别为6.7％、2.8％、5.2％）。01. 引言0深度神经网络已经证明了其学习图像分类的代表性特征的能力[34, 25, 37, 17]。在给定训练数据的情况下，图像分类[9,25]通常构建一个接受输入图像的特征提取器和一个生成图像预测概率的后续分类器。这是许多高级视觉任务中的常见流程，例如目标检测[14, 16]。0� 这些作者具有相等的贡献。†通讯作者是Yizhou Yu。0跟踪[42, 33, 38]和场景理解[8,31]。虽然使用上述流程训练的模型在许多图像分类基准上可以取得竞争性的结果，但其性能提升主要来自模型发现输入图像中最具有区分性的部分的能力。为了更好地理解训练好的深度神经网络并获得关于这种现象的见解，已经提出了许多技术[1, 54,2]来可视化深度网络的中间结果。在图1中，可以发现仅使用图像标签训练的深度卷积神经网络倾向于聚焦于最具有区分性的部分，而忽略其他物体部分。然而，仅关注最具有区分性的部分可能存在局限性。一些图像分类任务需要尽可能完整地掌握物体描述。完整的物体描述不必是一个整体，而可以使用多个部分描述组装在一起。为了消除冗余，这些部分描述应该相互补充。可以从这些完整描述中受益的图像分类任务包括Stanford Dogs 120 [21]和CUB 2011-200[47]上的细粒度分类任务，其中不同物体部分的外观共同对最终的分类性能做出贡献。根据上述分析，我们从不同的角度来处理图像分类，并提出了一个新的流程，旨在挖掘互补部分而不是上述最具区分性的部分，并在做出最终的分类决策之前融合这些挖掘到的互补部分。目标检测阶段。目标检测[10, 14,16]能够通过在大量位置上执行大量的分类来定位物体。在图1中，红色边界框是真实标签，绿色边界框是正样本目标提议，蓝色边界框是负样本提议。正样本和负样本之间的区别在于它们是否包含足够的信息（与真实边界框的重叠比例）来描述物体。如果我们观察图1中的激活图，可以明显看到正样本边界框的分布比核心区域要广泛得多。因此，我们假设esize that the positive object proposals that lay around thecore regions can be helpful for image classiﬁcation sincethey contain partial information of the objects in the image.However, the challenges in improving image classiﬁcation(a) Input(b) CAM(c) DetectionsFigure 1. Visualization of class activation map (CAM [54]) andweakly supervised object detections.by detection are two-fold. First, how can we perform objec-t detection without groundtruth bounding box annotations?Second, how can we exploit object detection results to boostthe performance of image classiﬁcation? In this paper, weattempt to tackle these two challenges in a weakly super-vised manner.To avoid missing any important object parts, we pro-pose a weakly supervised object detection pipeline regular-ized by iterative object instance segmentation. We start bytraining a deep classiﬁcation neural network that produces aclass activation map (CAM) as in [54]. Then the activationsin CAM are taken as the pixelwise probabilities of the corre-sponding class. A conditional random ﬁeld (CRF) [40] thenincorporates low level pairwise appearance information toperform unsupervised object instance segmentation. To re-ﬁne object locations and pixel labels, a Mask R-CNN [16]is trained using the object instance masks from the CRF.Results from the Mask R-CNN are used as a pixel probabil-ity map to replace the CAM in the CRF. We alternate MaskR-CNN and CRF regularization a few times to generate theﬁnal object instance masks.Image Classiﬁcation Phase. Directly reporting classiﬁca-tion results in the object detection phase gives rise to infe-rior performance because object detection algorithms makemuch effort to determine location in addition to class labels.In order to mine representative object parts with the help ofobject detection, we utilize the proposals generated in theprevious object detection phase and build a complementaryparts model, which consists of a subset of the proposals thatcover as much complementary object information as possi-ble. At the end, we exploit a bi-directional long short-termmemory network to encode the deep features of the objectparts for ﬁnal image classiﬁcation.In summary, this paper has the following contributions:∙ We introduce a new representation for image classiﬁca-tion, called weakly supervised complementary parts model,that attempts to grasp complete object descriptions using aselected subset of object proposals. It is an important stepforward in exploiting weakly supervised detection to boostimage classiﬁcation performance.∙ We develop a novel pipeline for weakly supervised ob-ject detection and instance segmentation. Speciﬁcally, weiterate the following two steps, object detection and seg-mentation using Mask R-CNN, and instance segmentationenhancement using CRF. In this way, we get strong objectdetection results and build accurate object part model.∙ To encode complementary information in different objectparts, we exploit a bi-directional long short-term memorynetwork to make the ﬁnal classiﬁcation decision. Experi-mental results demonstrate that we achieve state-of-the-artperformance on multiple image classiﬁcation tasks, includ-ing ﬁne-grained classiﬁcation on Stanford Dogs 120 [21]and Caltech-UCSD Birds 200-2011 [47], and generic clas-siﬁcation on Caltech 256 [15].2. Related WorkWeakly Supervised Object Detection and Segmentation.Weakly supervised object detection and segmentation re-spectively locates and segments objects with image labelonly [5]. In [7, 6], the object detection is solved as a clas-siﬁcation problem by speciﬁc pooling layers in CNNs. Themethod in [44] proposed an iterative bottom-up and top-down framework to expand object regions and optimize seg-mentation network iteratively. Ge et al. in [12] progres-sively mine the object locations and pixel labels with theﬁltering and fusion of multiple evidences.While here we perform the weakly supervised object in-stance detection and segmentation by feeding a coarse seg-mentation mask and proposal for Mask R-CNN [16] usingCAM [54] and rectifying the object locations and maskswith CRF [40] iteratively. In this way, we avoid losing im-portant object parts for subsequent object parts modeling.Part Based Fine-grained Image Classiﬁcation.Learn-ing a diverse collection of discriminative parts in asupervised[51, 50] or unsupervised manner [35, 52, 26] isvery popular in ﬁne-grained image classiﬁcation.Manyworks [51, 50] have been done to build object part modelswith part bounding box annotations. The method in [51]builds two deformable part models [10] to localize objectsand discriminative parts. Zhang et al. in [50] treats objectsand semantic parts equally by assigning them in differen-t object classes with R-CNN [14]. Another line of work-s [35, 52, 26, 44] estimate the part location in a unsuper-vised setting. In [35], parts are discovered based the neuralactivation, and then are optimized using a EM similar algo-rithm. The work in [35] extracts the highlight responses inCNN as the part prior to initialize convolutional ﬁlters, andthen learn discriminative patch detectors end-to-end.3035𝑴 𝑐(𝑥, 𝑦) =∑𝑘𝑤𝑐𝑘𝜙𝑘(𝑥, 𝑦),(1)𝑭 𝑛+1𝑖∈ℝ𝐻×𝑊 = max(1 −𝑛∑𝜄=1𝑭 𝜄𝑖∈ℝ𝐻×𝑊 , 0).(2)Then a conditional random ﬁeld (CRF) [40] is used toextract higher-quality object instances. In order to applyCRFs, a label map 𝑳 is generated according to the followingformula,𝑳𝑖∈ℝ𝐻×𝑊 ={𝜆, arg max𝜆 𝑭 𝜆𝑖∈ℝ𝐻×𝑊 > 𝜎𝑐0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒(3)where 𝜎𝑐 is always set to 0.8, a ﬁxed threshold used to de-termine how certain a pixel belongs to an object or back-ground. The label map 𝑳 is then fed into a CRF to gen-erate object instance segments, that are treated as pseudogroundtruth annotations for Mask-RCNN training. The pa-rameters in the CRF are the same as in [23]. Fig 2 stage 1shows the whole process of object instance segmentation.Jointly Detect and Segment Object Instances. Given a setof segmented object instances, 𝒮 = [𝒮1, 𝒮2, ...𝒮𝑛] of 𝑰, andtheir corresponding class labels, generated in the previous30360在本文中，我们的目标不是构建强大的部分检测器来为最终的分类决策提供局部外观信息。我们的补充部分模型的目标是有效利用在对象检测阶段产生的对象提议中隐藏的丰富信息。0使用LSTM进行上下文编码。LSTM网络在图像分类中编码上下文信息方面显示出强大的能力。在[26]中，Lam等人通过使用启发式网络、继承网络和单层LSTM来挖掘信息丰富的图像部分来解决细粒度图像分类问题。启发式网络负责从提议中提取特征，继承网络负责预测新的提议偏移量。单层LSTM用于融合信息，既用于最终的对象类别预测，也用于偏移量预测。在[46]中，通过在多标签图像分类中引入LSTM子网络，可以循环地发现注意区域，该LSTM子网络顺序预测位于定位区域上的语义标签得分，并同时捕捉空间依赖关系。0我们的补充部分模型中使用LSTM来整合不同对象提议中隐藏的丰富信息。与[26,46]中的单向LSTM不同，我们采用双向LSTM来学习所有图像补丁的深层次表示。实验结果表明，与单层LSTM相比，这种策略显著提高了性能。03.弱监督的补充部分模型03.1.概述0给定图像 � 及其对应的图像标签 �，本文提出的方法旨在通过对象检测来挖掘捕捉补充信息的对象的判别性部分 �，然后将挖掘到的补充部分融合到图像分类中。这是当前趋势的逆转 [16, 32,29]，即通过微调图像分类模型来进行对象检测。由于我们只有图像级别的标签而没有标记的部分位置，因此我们以弱监督的方式来制定问题。我们采用迭代细化的流程来改善对象部分的估计。然后，我们构建一个分类器，利用聚焦于对象部分的丰富上下文表示来提升分类性能。我们将流程分解为三个阶段，如图2所示，即弱监督的对象检测和实例分割、补充部分模型挖掘和具有上下文编码的图像分类。03.2.弱监督的对象检测和实例分割0粗糙对象掩膜初始化。给定图像 � 和其图像标签 �，最后一个卷积层的特征图被表示为 � ( � , � ) ∈ � � × � × �，其中 � 表示网络 � 的参数，� 是通道数，� 和 �分别是特征图的高度和宽度。接下来，对 �进行全局平均池化，得到池化特征 � � = ∑ �,� � � ( �, � )。最后添加分类层，因此，类别激活图 (CAM) 可以表示为0其中 � � � 是全局平均池化层中与类别 �对应的权重。得到的类别激活图 � �通过双线性插值上采样到原始图像尺寸 � � × �。由于一张图像可能有多个对象实例，在类别激活图 � �上可能观察到多个局部最大响应。我们对该图使用多区域水平集分割 [3]进行候选对象实例的分割。接下来，对于每个实例，将类别激活归一化到范围 [0, 1] 。假设在CAM中有 �个对象实例，我们根据归一化的CAM设置一个对象概率图 �∈ � ( � +1) × � × � ，其中前 �个对象概率图表示图像中存在某个特定对象的概率，第 ( � +1) 个概率图表示背景的概率。背景概率图的计算如下：Conv 5Conv 1Pool 1Conv 2Pool 2Conv 3Pool 3Conv 4Pool 4ROI AlignDetectionROI AlignMaskRCNNConvClass Activation MapInitial SegmentationMask R-CNNProbability MapInstance SegmentationCRFsCRFsStage 1: Weakly Supervised Instance Detection and SegmentationNMSRedundant Object ProposalsConstrained Object Part ModelSearchingStage 2: Complementary Parts Model��…CNNLSTM 1LSTM 2Softmax�CNNLSTM 1LSTM 2Softmax��CNNLSTM 1LSTM 2Softmax��CNNLSTM 1LSTM 2Softmax��CNNLSTM 1LSTM 2Softmax�� Stage 3: Image Classification with Context EncodingInputDetectionsComplementary Parts…Weakly Supervised Object DetectionComplementary Part Model…Figure 2. The proposed image classiﬁcation pipeline based on weakly supervised complementary parts model. From top to bottom: (a)Weakly Supervised Object Detection and Instance Segmentation: The ﬁrst step initializes the segmentation probability map by CAM [54],and obtaining coarse instance segmentation maps by CRF [40]. Then the segments and bounding boxes are used as groundtruth annotationsfor training Mask R-CNN [16] in an iterative manner. (b) Complementary Parts Model: Search for complementary object proposals toform the object parts model. (c) Image Classiﬁcation with Context Encoding: Two LSTMs [18] are stacked together to fuse and encodethe partial information provided by different object parts.stage, we obtain the minimum bounding box of each seg-ment to form a set of proposals, 𝒫 = [𝒫1, 𝒫2, ...𝒫𝑛]. Theproposals 𝒫, segments 𝒮 and their corresponding class la-bels are used for training Mask R-CNN for further proposaland mask reﬁnement. In this way, we turn object detec-tion and instance segmentation into fully supervised learn-ing. We train Mask R-CNN with the same setting as in [16].CRF-Based Segmentation. Suppose there are 𝑚 objectproposals, 𝒫★ = [𝒫★1, 𝒫★2, ..., 𝒫★𝑚], and their correspondingsegments, 𝒮★ = [𝒮★1, 𝒮★2, ..., 𝒮★𝑚] for image class 𝑐, whoseclassiﬁcation score is above 𝜎0, a threshold used to removeoutlier proposals. Then, a non-maximum suppression (N-MS) procedure is applied to 𝑚 proposals with overlappingthreshold 𝜏. Suppose 𝑛 object proposals remain afterwards,𝒪 = [𝒪1, 𝒪2, ..., 𝒪𝑛], where 𝑛 ≪ 𝑚.Most existing research utilizes NMS to suppress a largenumber of proposals sharing the same class label in order to30370在传统的目标检测方法中，通过非极大值抑制（NMS）可以获得少量的不同目标提议。然而，在我们的弱监督设置中，NMS过程中被抑制的提议实际上包含了丰富的目标部分信息，如图2所示。具体而言，每个被目标提议��抑制的提议�★�可以被视为��的互补部分。因此，被抑制的提议�★�可以用于进一步改进��。我们通过初始化一个类别概率图�★∈�(� +1)×�×�来实现这个想法。对于每个被��抑制的提议�★�，我们通过双线性插值将其提议分割掩码�★�的概率图添加到�★�的相应位置上。然后，将类别概率图归一化到[0, 1]。对于背景的第(�+ 1)个概率图，它被定义为0�★,� + 1� ∈ ��×� = 0� = 1 �★,�� ∈ ��×�, 0) . (4)score as the ﬁnal object part model. Searching for the op-timal subset of proposals maximizing the above score is acombinatorial optimization problem, which is computation-ally expensive. In the following, we seek an approximatesolution using a fast heuristic algorithm.Part Location Initialization.To initialize a parts mod-el, we simplify part estimation by designing a grid-basedobject parts template that follows two basic rules. First,every part should contain enough discriminative informa-tion; Second, the differences between part pairs should beas large as possible. As shown in Fig 2, deep convolutionalneural networks have demonstrated its ability in localizingthe most discriminative parts of an object. Thus, we set theroot part 𝑨𝑛+1 to be the object proposal 𝒪𝑖 that representsthe entire object. Then, a 𝑠 × 𝑠(= 𝑛) grid centered at 𝑨𝑛+1pothesises, 𝒮𝒜 = 𝒜1, 𝒜1, ..., 𝒜𝐾 is the set of object hy-potheses. As mentioned earlier, directly searching for anoptimal parts model can be intractable. Thus, we adopt agreedy search strategy to search for ˆ𝒜. Speciﬁcally, we se-quentially go through every 𝑨𝑖 in 𝑨 and ﬁnd the optimalobject part for 𝑨𝑖 in 𝒫★ that minimizes ˆ𝒜. The overall timecomplexity is reduced from exponential to linear (𝑂(𝑛𝑘)).In Fig 2, we can see that the object hypotheses generatedduring the search process cover different parts of the objectand do not focus on the core region only.3.4. Image Classiﬁcation with Context EncodingCNN Feature Extractor Fine-tuning.Given an inputimage 𝑰 and the parts model 𝒜 = [𝑨1, ..., 𝑨𝑛, 𝑨𝑛+1]constructed in the previous stage, the image patchescorresponding to the parts are denoted as 𝑰 (𝒜)=[𝑰 (𝑨1) , 𝑰 (𝑨2) , ..., 𝑰 (𝑨𝑛) , 𝑰 (𝑨𝑛+1)].During imageclassiﬁcation, random crops of images are often used totrain the model. Thus, apart from the (𝑛+1) patches, we ap-pend a random crop of the original image as the (𝑛 + 2)-ndimage patch. The motivation for adding a randomly croppedpatch is to include more context information during trainingsince those patches corresponding to object parts primarilyfocus on the object itself. Every patch shares the same la-bel with the original image it is cropped from. All patches30380给定类别概率图�★，再次应用CRF来改进和修正实例分割结果，如前一阶段所述。迭代实例改进。我们交替应用基于CRF的分割和基于MaskR-CNN的检测和实例分割多次，逐步改进目标实例的定位和分割。图2显示了迭代实例改进的过程。03.3. 互补部分模型0模型定义。根据前一阶段的分析，给定一个检测到的目标��，其对应的被抑制的提议�★,� = [�★,�1, �★,�2, ...,�★,��]可能包含有用的目标信息，并且可以定位正确的目标位置。因此，有必要为后续的分类任务识别最具信息量的提议。在本节中，我们提出了一个互补部分模型�用于图像分类。该模型由覆盖整个目标及其上下文的根部分、覆盖目标核心区域的中心部分以及覆盖不同目标部分但仍保持足够区分信息的固定数量的周围提议组成。一个具有�个部分的目标的互补部分模型被定义为一个(� + 1)元组� = [�1, ..., ��,��+1]，其中�1是目标中心部分，��+1是根部分，��是第�个每个部分模型由一个元组�� = [��,��]定义，其中��是第�个部分的特征，��是描述部分的几何信维元组，即部分中心和部分大小(��, ��, ��,��)。没有任何缺失部分的潜在部分模型被称为目标假设。为了使目标部分相互补充，它们的外观特征或位置的差异应尽可能大，同时部分得分的组合也应尽可能大。这些准则在寻找相互补充的具有区分性的部分时起到约束作用。目标假设的得分�(�)由所有目标部分的得分之和减去外观相似度和不同部分之间的空间重叠给出。0�(�) =0� = 1 �(��)0- �00� ∑0� = 10� = � + 1[��(��, ��) + �0��(��, ��)],0(5)其中 �(��) 是Mask R-CNN分类分支中第�个部分的分数，��(��, ��)= ∥�� - ��∥2是语义相似度，��(��,��)是部分�和�之间的空间重叠，有两个常数参数�0 = 0.01和�0 =0.1。给定一组对象假设，我们可以选择一个达到最大值的假设。0�，其中��+1和��+1是根部分��+1的宽度和高度。中心网格单元分配给目标中心部分。其余的网格单元分配给部分��，其中� ∈[2, 3, ..., �]。然后，我们将每个部分�� ∈�初始化为最接近分配网格单元的提议�★� ∈�★。部分模型搜索。对于具有�个目标部分（我们排除了第(� +1)个根部分）和�个候选被抑制的提议的模型，目标函数定义为ˆ� = arg max �∈� ��(�)，(6)��tanh�tanh��tanh�tanh��tanh ��tanh�tanh��h��FCSoftmaxFCSoftmaxCopy and SplitConcatenationElementwise Product / Sum / Tanhtanh�Fully Connected Layer + Sigmoid / Tanh Activationtanh�� Figure 3. Context encoded image classiﬁcation based on LSTMs.Two standard LSTMs [18] are stacked together. They have oppo-site scan

下载后可阅读完整内容，剩余1页未读，立即下载