Simultaneous Feature Learning and Hash Coding with Deep Neural Networks
Hanjiang Lai
†
, Yan Pan
∗
‡
, Ye Liu
§
, and Shuicheng Yan
†
†
Department of Electronic and Computer Engineering, National University of Singapore, Singapore
‡
School of Software, Sun Yan-Sen University, China
§
School of Information Science and Technology, Sun Yan-Sen University, China
Abstract
Similarity-preserving hashing is a widely-used method
for nearest neighbour search in large-scale image retrieval
tasks. For most existing hashing methods, an image is
first encoded as a vector of hand-engineering visual fea-
tures, followed by another separate projection or quantiza-
tion step that generates binary codes. However, such visual
feature vectors may not be optimally compatible with the
coding process, thus producing sub-optimal hashing codes.
In this paper, we propose a deep architecture for supervised
hashing, in which images are mapped into binary codes via
carefully designed deep neural networks. The pipeline of
the proposed deep architecture consists of three building
blocks: 1) a sub-network with a stack of convolution lay-
ers to produce the effective intermediate image features; 2)
a divide-and-encode module to divide the intermediate im-
age features into multiple branches, each encoded into one
hash bit; and 3) a triplet ranking loss designed to character-
ize that one image is more similar to the second image than
to the third one. Extensive evaluations on several bench-
mark image datasets show that the proposed simultaneous
feature learning and hash coding pipeline brings substan-
tial improvements over other state-of-the-art supervised or
unsupervised hashing methods.
1. Introduction
With the ever-growing large-scale image data on the
Web, much attention has been devoted to nearest neigh-
bor search via hashing methods. In this paper, we focus on
learning-based hashing, an emerging stream of hash meth-
ods that learn similarity-preserving hash functions to en-
code input data points (e.g., images) into binary codes.
Many learning-based hashing methods have been pro-
∗
Corresponding
author: Yan Pan, email: panyan5@mail.sysu.edu.cn.
posed, e.g., [8, 9, 4, 12, 16, 27, 14, 25, 3]. The existing
learning-based
hashing methods can be categorized into un-
supervised and supervised methods, based on whether su-
pervised information (e.g., similarities or dissimilarities on
data points) is involved. Compact bitwise representations
are advantageous for improving the efficiency in both stor-
age and search speed, particularly in big data applications.
Compared to unsupervised methods, supervised methods
usually embed the input data points into compact hash codes
with fewer bits, with the help of supervised information.
In the pipelines of most existing hashing methods for im-
ages, each input image is firstly represented by a vector of
traditional hand-crafted visual descriptors (e.g., GIST [
18],
HOG
[1]), followed by separate projection and quantiza-
tion steps to encode this vector into a binary code. How-
ever, such fixed hand-crafted visual features may not be op-
timally compatible with the coding process. In other words,
a pair of semantically similar/dissimilar images may not
have feature vectors with relatively small/large Euclidean
distance. Ideally, it is expected that an image feature rep-
resentation can sufficiently preserve the image similarities,
which can be learned during the hash learning process. Very
recently, Xia et al. [27] proposed CNNH, a supervised hash-
ing method in which the learning process is decomposed
into a stage of learning approximate hash codes from the su-
pervised information, followed by a stage of simultaneously
learning hash functions and image representations based
on the learned approximate hash codes. However, in this
two-stage method, the learned approximate hash codes are
used to guide the learning of the image representation, but
the learned image representation cannot give feedback for
learning better approximate hash codes. This one-way in-
teraction thus still has limitations.
In this paper, we propose a “one-stage” supervised hash-
ing method via a deep architecture that maps input images
to binary codes. As shown in Figure 1, the proposed deep
architecture
has three building blocks: 1) shared stacked
3270978-1-4673-6964-0/15/$31.00 ©2015 IEEE