Weakly Supervised Learning of Part Selection Model with
Spatial Constraints for Fine-Grained Image Classification
Xiangteng He, Yuxin Peng
∗
Institute of Computer Science and Technology, Peking University
Beijing 100871, China
pengyuxin@pku.edu.cn
Abstract
Fine-grained image classification is challenging due to the
large intra-class variance and small inter-class variance, aim-
ing at recognizing hundreds of sub-categories belonging
to the same basic-level category. Since two different sub-
categories is distinguished only by the subtle differences in
some specific parts, semantic part localization is crucial for
fine-grained image classification. Most previous works im-
prove the accuracy by looking for the semantic parts, but rely
heavily upon the use of the object or part annotations of im-
ages whose labeling are costly. Recently, some researchers
begin to focus on recognizing sub-categories via weakly su-
pervised part detection instead of using the expensive anno-
tations. However, these works ignore the spatial relationship
between the object and its parts as well as the interaction of
the parts, both of them are helpful to promote part selection.
Therefore, this paper proposes a weakly supervised part se-
lection method with spatial constraints for fine-grained im-
age classification, which is free of using any bounding box or
part annotations. We first learn a whole-object detector auto-
matically to localize the object through jointly using saliency
extraction and co-segmentation. Then two spatial constraints
are proposed to select the distinguished parts. The first spa-
tial constraint, called box constraint, defines the relationship
between the object and its parts, and aims to ensure that the
selected parts are definitely located in the object region, and
have the largest overlap with the object region. The second
spatial constraint, called parts constraint, defines the relation-
ship of the object’s parts, is to reduce the parts’ overlap with
each other to avoid the information redundancy and ensure
the selected parts are the most distinguishing parts from other
categories. Combining two spatial constraints promotes parts
selection significantly as well as achieves a notable improve-
ment on fine-grained image classification. Experimental re-
sults on CUB-200-2011 dataset demonstrate the superiority
of our method even compared with those methods using ex-
pensive annotations.
Introduction
Fine-grained image classification is an extremely challeng-
ing task, which aims to distinguish the objects in subordinate
classes, such as bird types (Wah et al. 2011), dog species
(Khosla et al. 2011), plant breeds (Angelova and Zhu 2013)
∗
Corresponding author.
Copyright
c
2017, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
and aircraft models (Maji et al. 2013) etc. An inexperienced
person can easily recognize basic-level categories such as
birds, horses and dogs, since they vary a lot in appearance.
He may know several kinds of birds, but it would be very
difficult to recognize 200 or even more sub-categories. For
example, it is extremely hard for an inexperienced person
to distinguish between Herring Gull and Slaty-backed Gull
whose appearance are very similar, as both of them have the
gray back and pink legs. These subordinate classes share the
same global appearance, and are often distinguished by the
subtle differences in their parts (e.g. Herring Gull and Slaty-
backed Gull are distinguished by the color of the back, the
latter’s is deeper). Therefore, the object and its salient parts
are crucial for fine-grained image classification.
Since the discriminative features are mainly localized on
the object and its parts, most existing works follow the
pipeline: localizing the object or its parts firstly, and then ex-
tracting discriminative features for fine-grained image clas-
sification. As the fine-grained image classification datasets
(e.g. CUB-200-2011 (Wah et al. 2011)) mostly have the
detailed annotations like bounding box and part locations,
early works directly use the detailed annotations at both
training and testing stage. The works of (Chai, Lempitsky,
and Zisserman 2013; Yang et al. 2012) use the provided
bounding box to learn part detectors in a unsupervised or
latent manner. Several methods even use the part annota-
tions (Berg and Belhumeur 2013; Xie et al. 2013). Since
the annotations of the testing image are not available in the
practical applications, some works use the object or part an-
notations only at training stage and no knowledge of anno-
tations at testing stage. Bounding box and Part annotations
are directly used in training phase to learn a strongly su-
pervised deformable part-based model (Zhang et al. 2013;
Azizpour and Laptev 2012) or directly used to fine-tune the
pre-trained Convolutional Neural Net (CNN) (Branson et al.
2014). Further more, Krause et al. (Krause et al. 2015) only
uses bounding box at training stage to learn the part de-
tectors, then localize the parts automatically in the testing
stage. Recently, there are some promising works attempting
to learn the part detectors under the weakly supervised con-
dition, i.e. the bounding box and part annotations are not
used at training or testing stage. These works make it possi-
ble to put the fine-grained image classification into practical
applications. Neural Activation Constellations Part Model
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)