Look Closer to See Better: Recurrent Attention Convolutional Neural Network
for Fine-grained Image Recognition
Jianlong Fu
1
, Heliang Zheng
2
, Tao Mei
1
1
Microsoft Research, Beijing, China
2
University of Science and Technology of China, Hefei, China
1
{jianf, tmei}@microsoft.com,
2
zhenghl@mail.ustc.edu.cn
Abstract
Recognizing fine-grained categories (e.g., bird species)
is difficult due to the challenges of discriminative region
localization and fine-grained feature learning. Existing
approaches predominantly solve these challenges indepen-
dently, while neglecting the fact that region detection and
fine-grained feature learning are mutually correlated and
thus can reinforce each other. In this paper, we propose
a novel recurrent attention convolutional neural network
(RA-CNN) which recursively learns dis criminative region
attention and region-based feature representation at multi-
ple sc ale s in a mutually reinforced way. The learning at
each scale consists of a classification sub-network and an
attention proposal sub-network (APN). The APN starts from
full images, and iteratively generates region attention from
coarse to fine by taking previous predictions as a reference,
while a finer scale network takes as input an amplified at-
tended region from previous scales in a recurrent way. The
proposed RA-CNN is optimized by an intra-scale class ifica-
tion loss and an inter-scale ranking loss, to mutually learn
accurate region attention and fine-grained representation.
RA-CNN does not need bounding box/part annotations and
can be trained end-to-end. We conduct comprehensive ex-
periments and show that RA-CNN achieves the best per for-
mance in three fine-grained tasks, with relative accuracy
gains of 3.3%, 3.7%, 3.8%, on CUB Birds, Stanford Dogs
and Stanford Cars, respectively.
1. Introduction
Recognizing fine-grained categories by computer vision
techniques (e.g., classifying bird species [
2, 34], flower
types [
21, 24], car models [14, 19], etc.) has attracted
extensive attention. The task is very cha llenging as some
fine-grained categories (e.g., “eared grebe” and “horned
grebe”) can only be recognized by domain experts. Differ-
ent from general recognition, the fine-grained image recog-
Figure 1. Two bird species of woodpecker. We can observe the
very subtle visual differences from highly local regions (e.g., head-
s in yellow boxes), which are difficult to learn from the original
image scale. However, the difference can be more vivid and sig-
nificant if we can learn to zoom into the attended regions at a finer
scale. [Best viewed in color]
nition should be capable of localizing and representing the
very marginal visual differences within subordinate ca te-
gories, and thus can benefit a wide variety of application-
s, e.g., expert-level image recognition [15, 31], rich image
captioning [
1, 12], and so on.
The challenges of fine-grained recognition are main-
ly two-fold: discriminative region localization and fine-
grained feature learning from those regions. Previous re-
search has made impressive progresses by introducing part-
based recognition frameworks, which typically consist of
two ste ps: 1) identifying possible object regions by an-
alyzing convolutional responses from neural networks in
an unsupervised fashion or by using supervised bounding
box/part annotations, and 2) extracting discriminative fea-
tures from e a ch region and encoding them into compact
vectors for recognition. Although promising results have
been reported, further improvement suffers from the fol-
lowing limitations. First, human-defined regions or the re-
gions learned by existing unsupervised methods may not
be optimal for machine classification [
35
]. Second, subtle
visual differences existed in loc al regions from similar fine-
1
4438