![](https://csdnimg.cn/release/download_crawler_static/15677655/bg1.jpg)
Binary Generative Adversarial Networks for Imag e Retrieval
Jingkuan Song
ABSTRACT
The most striking successes in image retrieval using deep hashing
have mostly involved discriminative models, which require labels. In
this paper, we use binary generative adversarial networks (BGAN) to
embed images to binary codes in an unsupervised way. By restricting
the input noise variable of generative adversarial networks (GAN)
to be binary and conditioned on the features of each input image,
BGAN can simultaneously learn a binary representation per image,
and generate an image plausibly similar to the original one. In the
proposed framework, we address two main problems: 1) how to
directly generate binary codes without relaxation? 2) how to equip
the binary representation with the ability of accurate image retrieval?
We resolve these problems by proposing new sign-activation strat-
egy and a loss function steering the learning process, which consists
of new models for adversarial loss, a content loss, and a neigh-
borhood structure loss. Experimental results on standard datasets
(CIFAR-10, NUSWIDE, and Flickr) demonstrate that our BGAN
significantly outperforms existing hashing methods by up to 107%
in terms of mAP (See Table 3)
1
.
KEYWORDS
Generative Adversarial Networks, Hashing, Image Retrieval
ACM Reference format:
Jingkuan Song. 2017. Binary Generative Adversarial Networks for Image
Retrieval. In Proceedings of ACM Conference, Washington, DC, USA, July
2017 (Conference’17), 9 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
With the rapidly increasing amount of images, similarity search
in large image collections has been actively pursued in a number
of domains, including computer vision, information retrieval and
pattern recognition [
39
,
43
]. However, exact nearest-neighbor (NN)
search is often intractable because of the size of dataset and the high
dimensionality of images. Instead, approximate nearest-neighbor
(ANN) search is more practical and can achieve orders of magnitude
in speed-up compared to exact NN search [18, 39].
Recently, learning-based hashing methods [
16
,
17
,
27
,
34
,
43
,
45
]
have become the mainstream for scalable image retrieval due to
their compact binary representation and efficient Hamming distance
1
Our anonymous code is available at: https://github.com/htconquer/BGAN
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
Conference’17, Washington, DC, USA
© 2017 ACM. 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
calculation. Such approaches embed data points to compact binary
codes through hash functions, which can be generally expressed as:
b = h
(
x
)
∈
{
0, 1
}
L
(1)
where
x ∈ R
M×1
,
h(.)
are the hash functions, and
b
is a binary vector
with code length L.
Hashing methods can be generally categorized as being unsuper-
vised or supervised. The unsupervised learning of a hash function
is usually based on the criterion of preserving important properties
of the training data points in the original space. Typical approaches
target pairwise similarity preservation (i.e., the similarity/distance
of binary codes should be consistent with that of the original data
points) [
17
,
32
,
46
], multi-wise similarity preservation (i.e., the simi-
larity orders over more than two items computed from the input space
and the coding space should be preserved) [
33
,
44
], or implicit simi-
larity preservation (i.e., pursuing effective space partitioning without
explicitly evaluating the relation between the distances/similarities
in the input and coding spaces), [
15
,
19
]. A fundamental limitation
of a hashing method geared to preserve a particular image property
is that its performance may degrade when it is is applied to a context
where a different property is relevant.
Supervised hashing is designed to generate the binary codes based
on predefined labels [
8
,
27
,
40
]. For example, Strecha et al. [
40
]
developed a supervised hashing which maximizes the between-class
Hamming distance and minimizes the within-class Hamming dis-
tance. [
8
,
27
] proposed to learn the hash codes such to approximate
the pairwise label similarity. Supervised hashing methods usually
significantly outperform unsupervised methods. However, the infor-
mation that can be used for supervision is also typically scarce.
More recently, deep learning has been introduced in the devel-
opment of hashing algorithms [
4
,
6
,
12
,
13
,
28
,
37
,
47
], leading
to a new generation of deep hashing algorithms. Due to powerful
feature representation, remarkable image retrieval performance has
been reported using the hashes obtained in this way. However, a
number of open issues have still remain open. The most successful
deep hashing methods are usually supervised and require labels. The
labels are, however, scarce and subjective. Unsupervised approaches,
on the other hand, cannot take full advantages of the current deep
learning models, and thus yield unsatisfactory performance [
28
]. An-
other issue is a non-smooth sign-activation function used to generate
the binary codes, which, despite several ideas proposed to tackle
it [2, 6, 26], still makes the standard back-propagation infeasible.
To address the above issues, we propose an unsupervised hashing
method that deploys a generative adversarial network (GAN) [
36
].
GAN has proven effective to generate synthetic data similar to the
training data from a latent space. Therefore, if we restrict the input
noise variable of generative adversarial networks (GAN) to be binary
and conditioned on the features of each input image, we can learn a
binary representation for each image and generate a plausibly similar
image to the original one simultaneously. Feeding the generated
images through a “discriminator” that verifies them with respect
to the training images removes the need for supervision and the
arXiv:1708.04150v1 [cs.CV] 8 Aug 2017