FINE-GRAINED VISUAL CATEGORIZATION WITH FINE-TUNED SEGMENTATION
Lingyun Li
1
, Yanqing Guo
1
, Lingxi Xie
2
, Xiangwei Kong
1
, Qi Tian
3
1
Dalian University of Technology, Dalian, Liaoning 116024, China
2
Dept. of Computer Science and Technology, Tsinghua University, Beijing 100084, China
3
Dept. of Computer Science, University of Texas at San Antonio, TX 78249, USA
ABSTRACT
Fine-grained visual categorization (FGVC) refers to the task
of classifying objects that belong to the same basic-level class
(e.g., different bird species). Since the subtle inter-class varia-
tion often exists on small parts (e.g., beak, belly, etc.), it is rea-
sonable to localize semantic parts of an object before describ-
ing it. However, unsupervised part-segmentation methods of-
ten suffer from over-segmentation which harms the quality of
image representation. In this paper, we present a fine-tuning
approach to tackle this problem. To this end, we perform a
greedy algorithm to optimize an intuitive objective function,
preserving principal parts meanwhile filtering noises, and fur-
ther construct mid-level parts beyond the refined parts toward
a more descriptive representation. Experiments demonstrate
that our approach achieves competitive classification accura-
cy on the CUB-200-2011 dataset with both Fisher vectors and
deep conv-net features.
Index Terms— Fine-Grained Visual Categorization,
Part-based Model, Object Segmentation, Refinement.
1. INTRODUCTION
Fine-grained visual categorization (FGVC) refers to the task
of distinguishing subordinate categories (e.g., tree sparrow,
Ivory gull, Anna hummingbird, etc.) which belong to the same
basic-level category (bird). The subtle inter-class variation is
often the major challenge of FGVC.
The Bag-of-Features (BoF) model is widely adopted for
image classification. It extracts local descriptors, encodes and
summarizes them into a global image representation. Some-
times, spatial context modeling is adopted to group descrip-
tors according to their coordinates on the image. To introduce
more visual clues based on parts, unsupervised part detectors
are proposed for fine-grained tasks. Template matching mod-
els are adopted to automatically discover object parts [1] [2],
and the Deformable Part Model (DPM) is verified efficient for
part alignment [3] [4]. Researchers also suggest to partition
the segmented foreground into parts in both supervised [5]
and unsupervised [6] manners. However, unsupervised part
detectors [4] [6] often suffer from over-segmentation, which
leads to ambiguous image representation and, consequently,
Fig. 1: Sample images from the CUB-200-2011 dataset [7]
(best viewed in color). Each image is cropped with the pro-
vided bounding box. Top: Examples of fine-grained align-
ment [6]. Bottom: Examples of symbiotic segmentation and
part localization [4].
unsatisfied classification accuracy. An over-segmented exam-
ple is shown in the upper-right part of Figure 1.
In this paper, we propose a simple fine-tuning algorithm
to combat over-segmentation. Based on a straightforward in-
tuition, we formulate the fine-tuning process with an objective
function, and optimize it using a greedy algorithm. We further
construct mid-level visual concepts on the basis of the refined
parts with a bruteforce search. It is verified that, although the
number of parts is decreased during mergence and combina-
tion, higher classification accuracy is achieved, implying that
more discriminative image representation is obtained. The
main contribution of this paper is to provide an evidence on
the benefit of fine-tuned segmentation for fine-grained visual
categorization. We evaluate our algorithm with a bird classi-
fication task on the CUB-200-2011 dataset [7], and demon-
strate competitive performance, i.e., 65.13% with Fisher vec-
tors and 70.34% with deep conv-net features.
2. RELATED WORKS
Fine-grained visual categorization (FGVC) is aimed at dis-
criminating images of the same basic-level concept, such as
flower [8], aircraft [9], dog [10] and bird [7]. It is closely re-
lated to two well studied topics in computer vision, i.e., image
representation and object part detection.
978-1-4799-8339-1/15/$31.00 ©2015 IEEE