SPECIAL SECTION ON INNOVATION AND APPLICATION OF INTELLIGENT PROCESSING OF
DATA, INFORMATION AND KNOWLEDGE AS RESOURCES IN EDGE COMPUTING
Received August 5, 2019, accepted August 13, 2019, date of publication August 19, 2019, date of current version September 5, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2936118
Fine-Grained Classification via Hierarchical
Bilinear Pooling With Aggregated Slack Mask
MIN TAN
1
, (Member, IEEE), GUIJUN WANG
1
, JIAN ZHOU
1
,
ZHIYOU PENG
2
, AND MEILIAN ZHENG
3,4
1
Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University,
Hangzhou 310018, China
2
Department of Pain Medicine, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310003, China
3
School of Management, Zhejiang University of Technology, Hangzhou 310023, China
4
Zhejiang Hithink RoyalFlush Artificial Intelligence Research Institute, Hangzhou 310023, China
Corresponding author: Meilian Zheng (zmldlk@zjut.edu.cn)
This work was supported in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LY19F020038, in part by the
National Natural Science Foundation of China under Grant 61602136, Grant 81603198, and Grant 61622205, and in part by the Zhejiang
Provincial Key Science and Technology Project Foundation under Grant 2018C01012.
ABSTRACT Extracting discriminative fine-grained features is essential for fine-grained image recognition
tasks. Many researchers utilize expensive human annotations to learn discriminative part models, which may
be impossible for real-world applications. Recently, bilinear pooling has been frequently adopted and has
shown its effectiveness owing to its learning discriminative regions automatically. However, most bilinear
pooling models still utilize the all convolutional part/region features for recognition, including those noisy
or even harmful feature elements. In this paper, we devise a novel fine-grained image classification approach
by the Hierarchical Bilinear Pooling with Aggregated Slack Mask (HBPASM) model. The proposed model
generates a RoI-aware image feature representation for better performance. We conduct experiments on
three frequently used fine-grained image classification datasets. The experimental results demonstrate that
HBPASM achieves competitive performance or even match the state-of-the-art methods on CUB-200-2011,
Stanford Cars, and FGVC-Aircraft, respectively.
INDEX TERMS Fine-grained classification, image mask, multi-scale, RoI feature, deep learning.
I. INTRODUCTION
Owing to the development of deep learning, many efforts
have been made in many computer vision tasks. Though with
much progress, there are still many challenges in fine-grained
classification tasks. Unlike traditional image classification
tasks, fine-grained classification aims at identifying subcate-
gories with subtle visual differences. These visual differences
can be easily confused by complex image background in
images. Therefore, it is necessary to reduce the impact of
background information and extract discriminative RoI fea-
tures for fine-grained classification tasks.
Fine-grained image classification aims at distinguish-
ing different subordinate classes with subtle visual
differences [1]–[3]. It serves as a core problem in many
multimedia applications [4], e.g., image understanding,
The associate editor coordinating the review of this article and approving
it for publication was Ying Li.
cross-modal retrieval, etc. Though many efforts have been
made to improve the performance [5], high visual similarities
among different categories still challenge this task [6], espe-
cially when images have cluttered background. To deal with
the subtle visual differences, researchers focus on localizing
distinctive regions or extracting discriminative features for
improved performance.
Many efforts have been made to design part-based models
to localize object parts as the distinctive regions [7]–[12].
These models are obtained by analyzing the convolu-
tional activations from neural network in an unsupervised
manner or discriminatively training part detectors with
supervised bounding-box/part annotations. Among these
models, bilinear convolutional neural network (CNN) model,
i.e., BCNN, [13] and its variants [14], [15] have achieved sat-
isfactory results on many fine-grained image datasets. They
helps learn distinctive regions without utilizing expensive
part annotations, and the distinctive regions are discovered by
117944
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 7, 2019