DEEP IMAGE RETRIEVAL: A SURVEY 5
CC
CC
GeM
MAC
R-MAC
H×W,MaxPooling
Feature Maps
H
W
C
(Channelwise)
C×1
H
W
C
1
2
K
1
K
MaxPooling
for each region
(Channelwise)
K
C×1
......
......
......
C×K
H
W
C
C×1
H×W,
Average
Pooling
(Channelwise)
C× 1
SPoC
C
H/2
W/ 2
H×W, SumPooling
(Channelwise)
C×1
CroW
CC
H×W, SumPooling
(Channelwise)
C×1
Glob al
Average
Pooling
CAM+CroW
H
W
C
H
W
C
H×W, SumPooling
(Channelwise)
C×1
Need to compute
for top K (K<L) classes
...
Channel Weights
Computing
Classifier
Class k
Class Activation Mapping (CAM)
Selected
Weights
C
H
W
C
H
W
C
H
W
C
H
W
Class L
Class 1
,,
1 1 1
{{{ } } }
i j c
H W C
i j c
x
= = =
22
22
,,
2
( ) ( )
exp
2
WH
i j c
ij
− + −
=−
( )
,,
1
{ } ,
i j c
HW
ij c
gx
=
( )
,,
1
2
{ } ,
c i j c ij
C
gx
=
( )
,,
1
2
{ } ,
c i j c ij
C
gx
=
Fig. 5: Representative methods in single feedforward
frameworks, focusing on convolutional feature maps: MAC
[48], R-MAC [28], GeM pooling [42], SPoC with the Gaussian
weighting scheme [7], CroW [10], and CAM+CroW [29]. Note
that g
1
() and g
2
() represent spatial-wise and channel-wise
weighting functions, respectively.
is more important than final classification probabilities. This
section will survey the strategies which have been developed
to improve the quality of feature representations, particularly
based on feature extraction / fusion (Section 3.1) and feature
enhancement (Section 3.2).
3.1 Deep Feature Extraction
3.1.1 Network Feedforward Scheme
a. Single Feedforward Pass Methods.
Single feedforward pass methods take the whole image and
feed it into an off-the-shelf model to extract features. The ap-
proach is relatively efficient since the input image is fed only
once. For these methods, both the fully-connected layer and
last convolutional layer can be used as feature extractors [70].
The fully-connected layer has a global receptive field so that
it is able to produce more semantic-aware features [13]. After
normalization and dimensionality reduction, these features are
used for direct similarity measurement without further feature
processing and admitting efficient search strategies [25, 26, 34].
Fig. 6: Image patch generation schemes: (a) Rigid grid; (b)
Spatial pyramid modeling (SPM) splits an image into different
scales and positions (blue, green and red boxes); (c) Dense
patch sampling, where a fixed-size sliding window samples the
image; (d) Region proposals (RP), in which the specific object
or instance is extracted as region proposals.
Using the fully-connected layer may result in insufficient
performance since it lacks geometric invariance and spatial in-
formation, so the last convolutional layer can be examined in-
stead. The research foci associated with the use of convolu-
tional features is to improve their discrimination, where repre-
sentative strategies are shown in Figure 5. One direction is to
treat regions in feature maps as different sub-vectors, thus com-
binations of different sub-vectors of all feature maps are used to
represent the input image. For instance, Gordo et al. [38] apply
regional maximum activation of convolutions (R-MAC) [28] to
obtain relevant regions on each feature map, which filters out
some irrelevant (background) information and is beneficial for
extracting instance-relevant features. Inspired by R-MAC, Li
et al. [59] propose a non-linear feature embedding method for
visual object retrieval and achieve remarkable performance im-
provements compared to the state of the art.
b. Multiple Feedforward Pass Methods.
Compared to single-pass schemes, multiple pass methods
are more time-consuming [8] because several patches are gen-
erated from an input image and are both fed into the network
before being encoded as a final global feature.
Multiple-pass strategies can lead to higher retrieval accu-
racy since representations are produced from two stages: patch
detection and patch description. Multi-scale image patches are
obtained using sliding windows [26, 71], random cropping [25,
57], and spatial pyramid model (SPM) [32], as illustrated in
Figure 6. For example, Xu et al. [72] randomly sample win-
dows within an image at different scales and positions, then
“edgeness” scores are calculated to represent the edge density
within the windows.
These patch detection methods lack retrieval efficiency
for large-scale datasets since irrelevant patches are also fed
into deep networks, therefore it is necessary to analyze
image patches [28]. As an example, Cao et al. [73] propose to
merge image patches into larger regions with different hyper-
parameters, then the hyper-parameter selection is viewed as
an optimization problem under the target of maximizing the
similarity between features of the query and the candidates.