Learning global representations for image search 5
tion in an end-to-end manner. To that aim we leverage a three-stream Siamese
network with a triplet ranking loss. We also describe how to learn the pooling
mechanism using a region proposal network (RPN) instead of relying on a rigid
grid (Section 3.2). Finally we depict the overall descriptor extraction process for
a given image (Section 3.3).
3.1 Learning to retrieve particular objects
R-MAC revisited. Recently, Tolias et al. [14] presented R-MAC, a global im-
age representation particularly well-suited for image retrieval. The R-MAC ex-
traction process is summarized in any of the three streams of the network in
Fig. 1 (top). In a nutshell, the convolutional layers of a pre-trained network
(e.g. VGG16 [46]) are used to extract activation features from the images, which
can be understood as local features that do not depend on the image size or
its aspect ratio. Local features are max-pooled in different regions of the image
using a multi-scale rigid grid with overlapping cells. These pooled region features
are independently `
2
-normalized, whitened with PCA and `
2
-normalized again.
Unlike spatial pyramids, instead of concatenating the region descriptors, they
are sum-aggregated and `
2
-normalized, producing a compact vector whose size
(typically 256-512 dimensions) is independent of the number of regions in the
image. Comparing two image vectors with dot-product can then be interpreted
as an approximate many-to-many region matching.
One key aspect to notice is that all these operations are differentiable. In
particular, the spatial pooling in different regions is equivalent to the Region of
Interest (ROI) pooling [47], which is differentiable [48]. The PCA projection can
be implemented with a shifting and a fully connected (FC) layer, while the gradi-
ents of the sum-aggregation of the different regions and the `
2
-normalization are
also easy to compute. Therefore, one can implement a network architecture that,
given an image and the precomputed coordinates of its regions (which depend
only on the image size), produces the final R-MAC representation in a single
forward pass. More importantly, one can backpropagate through the network ar-
chitecture to learn the optimal weights of the convolutions and the projection.
Learning for particular instances. We depart from previous works on fine-
tuning networks for image retrieval that optimize classification using cross-
entropy loss [17]. Instead, we consider a ranking loss based on image triplets.
It explicitly enforces that, given a query, a relevant element to the query and a
non-relevant one, the relevant one is closer to the query than the other one. To
do so, we use a three-stream Siamese network in which the weights of the streams
are shared, see Fig. 1 top. Note that the number and size of the weights in the
network (the convolutional filters and the shift and projection) is independent of
the size of the images, and so we can feed each stream with images of different
sizes and aspect ratios.
Let I
q
be a query image with R-MAC descriptor q, I
+
be a relevant image
with descriptor d
+
, and I
−
be a non-relevant image with descriptor d
−
. We