LAI et al.: INSTANCE-AWARE HASHING FOR MULTI-LABEL IMAGE RETRIEVAL 2471
Learning-based hashing (or Learning-to-hash) pursues a
compact binary representation from the training data. Based
on whether side information is used or not, learning-to-hash
methods can be divided into two categories: unsupervised
methods and supervised methods.
Unsupervised methods try to learn a set of similarity-
preserving hash functions only from the unlabeled data.
Representative methods in this category include Kernelized
LSH (KLSH) [2], Semantic hashing [13], Spectral
hashing [14], Anchor Graph Hashing [3], and Iterative
Quantization (ITQ) [1]. Kernelized LSH (KLSH) [2]
generalizes LSH to accommodate arbitrary kernel functions,
making it possible to learn hash functions which preserve data
points’ similarity in a kernel space. Semantic hashing [13]
generates hash functions by a deep auto-encoder via stacking
multiple restricted Boltzmann machines (RBMs). Graph-based
hashing methods, such as Spectral hashing [14] and Anchor
Graph Hashing [3], learn non-linear mappings as hash
functions which try to preserve the similarities within the
data neighborhood graph. In order to reduce the quantization
errors, Iterative Quantization (ITQ) [1] seeks to learn an
orthogonal rotation matrix which is applied to the data matrix
after principal component analysis projections.
Supervised methods aim to learn better bitwise repre-
sentations by incorporating supervised information. Notable
methods in this category include Binary Reconstruction
Embedding (BRE) [6], Minimal Loss Hashing (MLH) [15],
Supervised Hashing with Kernels (KSH) [5], Column Gen-
eration Hash (CGHash) [16], and Semi-Supervised Hashing
(SSH) [17]. Binary Reconstruction Embedding (BRE) [6]
learns hash functions by explicitly minimizing the reconstruc-
tion errors between the original distances of data points and
the Hamming distances of the corresponding binary codes.
Minimal Loss Hashing (MLH) [15] learns similarity-
preserving hash codes by minimizing a hinge-like loss func-
tion which is formulated as structured prediction with latent
variables. Supervised Hashing with Kernels (KSH) [5] is a
kernel-based supervised method which learns to hash the data
points to compact binary codes whose Hamming distances are
minimized on similar pairs and maximized on dissimilar pairs.
Column Generation Hash (CGHash) [16] is a column gen-
eration based method to learn hash functions with proximity
comparison information. Semi-Supervised Hashing (SSH) [17]
learns hash functions via minimizing similarity errors on the
labeled data while simultaneously maximizing the entropy of
the learnt hash codes over the unlabeled data. In most image
retrieval applications, the number of labeled positive samples
is small, which results in bias towards the negative samples and
over-fitting. Tao et al. [18] proposed an asymmetric bagging
and random subspace SVM (ABRS-SVM) to handle these
problems.
In supervised hashing methods for image retrieval,
an emerging stream is the deep-networks-based
methods [4], [8], [9], [19] which learn image representations
as well as binary hash codes. Xia et al. [4] proposed
Convolutional-Neural-Networks-based Hashing (CNNH),
which is a two-stage method. In its first stage, approximate
hash codes are learned from the supervised information.
Then, in the second stage, hash functions are learned based
on those approximate hash codes via deep convolutional
networks. Lai et al. [8] proposed a one-stage hashing
method that generates bitwise hash codes via a carefully
designed deep architecture. Zhao et al. [9] proposed a
ranking based hashing method for learning hash functions
that preserve multi-level semantic similarity between images,
via deep convolutional networks. Lin et al. [20] proposed
to learn the hash codes and image representations in a
point-wised manner, which is suitable for large-scale datasets.
Wang et al. [21] proposed Deep Multimodal Hashing with
Orthogonal Regularization (DMHOR) method for multimodal
data. All of these methods generate one piece of hash code for
each image, which may be inappropriate for multi-label image
retrieval. Different from the existing methods, the proposed
method can generate multiple pieces of hash codes for an
image, each piece corresponding to a(n) instance/category.
III. T
HE PROPOSED METHOD
Our method consists of four modules. The first module is to
generate region proposals for an input image. The second mod-
ule is to capture the features for the generated region proposals.
It contains a deep convolution sub-network followed by a
Spatial Pyramid Pooling layer [11]. The third module is a label
probability calculation module, which outputs a probability
matrix whose i-th row represents the probability scores of the
i-th proposal belonging to each class. The fourth module is a
hash coding module that firstly generates the instance-aware
representation, and then converts this representation to hash
codes for either category-aware hashing or semantic hashing.
In the following, we will present the details of these modules,
respectively.
A. Region Proposal Generation Module
Many methods for generating category-independent region
proposals have been proposed, e.g., Constrained Para-
metric Min-Cuts (CPMC) [22], Selective Search [23],
Multi-scale Combinatorial Grouping (MCG) [24], BInarized
Normed Gradients (BING) [25] and Geodesic Object Propos-
als (GOP) [26]. In this paper, we use GOP [26] to automat-
ically generate region proposals for an input. Note that other
methods for region proposal generation can also be used in
our framework.
GOP is a method that can generate both segmentation masks
and bounding box proposals. We use the code
1
provided by the
authors to generate the bounding boxes for region proposals.
B. Deep Convolution Sub-Network Module
GoogLeNet [27] is a recently proposed deep architecture
that has shown its success in object categorization and object
detection. The core of GoogLeNet is the Inception-style con-
volution module which allows increasing the depth and width
of the network while keeping reasonable computational costs.
Here we adopt the architecture of GoogLeNet as our basic
framework to compute the features for the input proposals.
1
http://www.philkr.net/home/gop