data and cross-modal retrieval task. On the other hand, instead of using Laplacian Eigenmap to solve the problem of maintaining
manifold structure, we directly utilize Locally Linear Embedding to obtain better results in local neighbor information preserving.
2.2. Quantization in hashing
In hashing methods, obtaining binary codes requires a quantization process. In the generation of hash codes, Hypercube
Quantization (Gong, Lazebnik, Gordo, & Perronnin, 2013) is commonly used, which quantizes data points into a set of vertices of a
hypercube. When quantization center points are fixed at 1 or −1, the quantization is the Hypercube Quantization, and the problem is
written as a mathematical expression.
(1)
where
. The typical methods are Iterative Quantization (Gong et al., 2013). Isotropic Hashing (Kong & Li, 2012), Har-
monious Hashing (Xu et al., 2013), Angular Quantization (Gong, Kumar, Verma, & Lazebnik, 2012). Taking Iterative Quantization as
an example, it reduces the dimensionality of the original data by principal component analysis, and then it maps the original data to
the vertices of a hypercube to solve the projection matrix with the smallest quantization error. Recently, some clustering methods and
classification methods (Nie, Tian, & Li, 2018) have introduced such quantization method, and hashing methods for single modal
retrieval utilize it to achieve better performance. Currently, some quantization based methods have been proposed for cross-modal
retrieval. Shared Predictive Cross-Modal Deep Quantization (Yang et al., 2018) learns the quantizer in a common subspace by
semantic label alignment. Differently from it, we adopt the method of orthogonal rotation quantization on the common semantic
space to reduce time consumption. In addition, cross-modal or multi-modal retrieval methods rarely take into account quantization
algorithms, therefore, quantization is a worthwhile consideration in the field of cross-modal hashing.
2.3. Cross-modal hashing
Cross-modal hashing methods have the advantages of satisfied retrieval efficiency and low storage cost in dealing with large-scale
data. They can also be divided into supervised ones and unsupervised ones based on whether label information is used during training
process. Unsupervised cross-modal hashing methods explore the intra- and inter-modal similarity to learn the hash codes. For in-
stance, Inter-Media Hashing (IMH) (Song et al., 2013) explores a linear regression model to learn hashing functions for each media
type and introduces inter-media consistency and intra-media consistency to find a common Hamming space. Both Collective Matrix
Factorization Hashing (CMFH) (Ding et al., 2014) and Cluster-based Joint Matrix Factorization Hashing (J-CMFH) (Rafailidis &
Crestani, 2016) utilize matrix factorization model to capture the latent structure between data and learn unified hash codes in a
common latent space. In the unified latent space, J-CMFH learns cluster representations for cross-modal instances and captures the
inter-modality and intra-modality similarities. Unsupervised Semantic-Preserving Adversarial Hashing (USePAH) (Deng et al., 2019)
designs a generative adversarial framework and constructs feature similarity and neighbor similarity to guide the learning process,
and learns the hash code in an unsupervised manner. Latent Semantic Sparse Hashing (LSSH) (Zhou et al., 2014) learns the latent
factors by sparse coding of image modality and matrix factorization of text modality respectively. Although these methods can
explore correlation of heterogeneous data, the learned hash codes are not discriminative enough and semantic similarity is not well
preserved in Hamming space. Thus they cannot obtain further performance improvement to adapt real-world applications.
For supervised ones, they consider label information to train the model. General speaking, since they utilize semantic labels to
mitigate the semantic gap and make the learned hash codes more discriminative in semantics, supervised methods can achieve better
retrieval performance than unsupervised ones. A series of meaningful and representative supervised cross-modal hashing methods
have been proposed. For instance, Semantic Correlation Maximization (SCM) (Zhang & Li, 2014) maximizes the semantic correlation
to learn the hash codes by utilizing the semantic information. Based on CMFH (Ding et al., 2014), Supervised Matrix Factorization
hashing (SMFH) (Tang et al., 2016) maintains semantic similarity in Hamming space by constraining the hash function with semantic
label information. Semantic Preserving Hashing (SePH) (Lin et al., 2015) constructs the affinity matrix by supervised information and
learns hash codes by utilizing K-L divergence to approximate the affinity matrix. Pairwise Relationship Guided Deep Hashing
(PRGDH) (Yang, Deng et al., 2017) uses different pairwise constraints for inter-modal and intra-modal data and generates dis-
criminative hash codes by end-to-end deep network. It also uses the semantic similarity matrix in the hash code learning process.
Generalized Semantic Preserving Hashing (GSePH) (Mandal, Chaudhury, & Biswas, 2017) constructs similarity matrix in single-label
paired, single-label unpaired, multi-label paired and multi-label unpaired scenarios and learns hash codes by minimizing the simi-
larity matrix and hamming distance. Discrete Cross-modal Hashing (DCH) (Xu et al., 2017) takes semantic label as a classification and
learns the hash codes by a discrete bit-wise optimization manner. These supervised methods are very representative work and have
achieved good results. But it is worth noting that most methods consider semantic label as pairwise similarities and neglect the
category information.
3. Our method
In this section, we describe the proposed method in detail, and show the whole framewok in Fig. 1.
H. Zeng, et al.
Information Processing and Management xxx (xxxx) xxxx
3