Large-Scale E-Commerce Image Retrieval with
Top-Weighted Convolutional Neural Networks
Shichao Zhao
1
, Youjiang Xu
1
, Yahong Han
1,2
1
School of Computer Science and Technology, Tianjin University, Tianjin, China
2
Tianjin Key Lab. of Cognitive Computing & Application, Tianjin University, Tianjin, China
{zhaoshichao, yjxu, yahong}@tju.edu.cn
ABSTRACT
Several recent researches have shown that image features
produced by Convolutional Neural Networks (CNNs) pro-
vide the state-of-the-art performance for image classification
and retrieval. Moreover, some researchers have found that
the features extracted from the deep convolutional layers of
CNNs perform better than that from the fully-connected lay-
ers. Features extracted from the convolutional layers have
a natural interpretation: descriptors of local image regions
correspond well to the receptive fields of the particular fea-
tures. In order to obtain both representative and discrimina-
tive descriptors for large-scale e-commerce image retrieval,
we come up with a new feature extraction framework. At
first, we propose the Top-Weight method to detect the inter-
esting area of e-commerce images automatically. With the
estimated weight, we then aggregate local deep features and
produce high-quality global representation for e-commerce
image retrieval. We have conducted experiments on an e-
commerce dataset ALISC [1] released by Alibaba Group.
Experimental results show that our method outperforms
other deep learning based methods.
Keywords
Image Features, CNNs, Top-Weight
1. INTRODUCTION
With the rapid progress of digital techniques, the amount
of digital images is increasing explosively. This trend makes
image retrieval an important and challenging research topic
nowadays. Especially, the task of e-commerce image re-
trieval has bright prospect and great commercial value.
For much of the past decade, bag-of-features methods were
considered to be the state-of-the-art [9], especially when
built on top of locally invariant features like SIFT [8]. In
recent years, deep convolutional neural networks have at-
tracted much attention in visual recognition, largely due to
their good performance. It has been discovered that the ac-
tivations of CNNs pretrained on a large dataset, such as Im-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
ICMR’16, June 06-09, 2016, New York, NY, USA
c
2016 ACM. ISBN 978-1-4503-4359-6/16/06. .. $15.00
DOI: http://dx.doi.org/10.1145/2911996.2912052
ageNet [4], can be employed as a generalized image represen-
tation. And this representation could be adapted to many
visual problems which delivers impressive performance. At
first, researchers tried to utilize the fully-connected layers
of deep networks as global image representation [3]. With
the evolution of deep representations, research attention has
shifted from the fully-connected layers to the deep convolu-
tional layers of CNNs. For example, local convolutional de-
scriptors are extracted from deep convolutional layers. How
to aggregate a set of local descriptors into a global one has
been studied extensively [5]. The best known aggregation
approaches which have been used on SIFT are VLAD [7]
and Fisher Vectors [10]. However, as the differences between
deep convolutional features and hand-crafted features like
dense SIFT, it was shown that the preliminary embedding
step is not necessary for deep convolutional features because
of their higher discriminative ability and different distribu-
tion property. So for deep convolutional features, usually
only aggregation step is performed. What is more, SPoC
[2] proposed a method named center prior based on priori-
knowledge which achieves good performance on benchmark
datasets. However, this descriptor aggregation approach has
a strong hypothesis that the interesting object lies in the
center of image. In many applications, such a hypothesis
usually turns out to be a constraint that burdens the per-
formance. Especially for the e-commerce images, which are
different with the scene images, the target objects of the e-
commerce images are very important for retrieval. Thus, the
crux of the problem is to find the target object accurately.
Motivated by the above discussions, we put forward a
more effective feature representation for e-commerce image
retrieval. In this framework, we propose a new concept
named Top-Weight, which pools the top convolutional layer
by channel using average pooling. Figure 1 illustrated the
flowchart of extracting the Top-Weight. And we can see that
the Top-Weight exhibits high correlation with the target ar-
eas. Thus, we can get more discriminative image features
with Top-Weight. We first calculate the Top-Weight from
the top convolutional layer. Then we multiply the calculated
Top-Weight by the features extracted from various convolu-
tional layers varying from the shallow layer to the deep layer
of the CNNs. Our goal is get both low-level and high-level
information. Finally, we aggregate a set of local convolu-
tional descriptors into a final feature representation for im-
age retrieval. Based on the proposed method, we report
the experimental results on an e-commerce dataset sampled
from ALISC and the results demonstrate the effectiveness of
our method compared with other CNNs method.