Unsupervised Video Hashing via Deep Neural Network
submodular hashing framework to index videos, which represented the video by the average
of individual frames. However, the specific temporal structure between frames is neither
considered nor encoded into the binary codes, thus the temporal information may be lost
[36]. Moreover, these methods are based on hand-crafted features designed for static images,
e.g., GIST [20] and SIFT [17], which are not suitable for video hashing. Inspired by the
great success of deep learning, Wu et al. [36] proposed an unsupervised deep video hashing
method which extracts the video features through the deep neural network and then learns
the hash function in an end-to-end way. However, the proposed framework only takes the
quantization error and the variance balance into consideration, whose motivation is much
similar to ITQ-CCA [6]. The expandability of the proposed framework to other hashing
algorithms could not be guaranteed.
Recently, it has been found that the visual recognition can be significantly boosted by deep
neural network [4,7,14,24,31,34,41,43]. On one hand, convolutional neural network (CNN)
is effective in learning spatial structure of images, thus it has been exploited in many computer
vision applications [14,24,43]. By utilizing the feature vectors generated by the seventh layer
of CNN, the method proposed by [14] achieved the state-of-the-art performance in image
retrieval on ImageNet dataset [3]. On the other hand, recurrent neural network (RNN) is well
known to be “deep in time”, which is able to form implicit compositional representations in
the time domain [4]. Long-short term memory (LSTM) [10], a successful variant of RNN,
has shown the state-of-the-art performance in video classification and caption [4,7,34,41].
The combination of CNN and LSTM can provide both spatial feature of each frame and
temporal correlation between successive frames, and thus it has been utilized for many tasks.
The advantages of the deep neural networks inspire us to apply them to video hashing tasks.
In this paper, we construct an unsupervised hashing framework that is composed of four
key components: CNN, LSTM, time series pooling layer, and unsupervised hashing function
learning component. To be specific, the spatial features of videos are obtained by utilizing
CNN, and the temporal features are established via LSTM network. In order to obtain a
single hashing code for each video, we adopt time series pooling strategy to pool the frame-
level features into the video-level features. After that, the obtained feature vectors are fed
into unsupervised hashing function learning component to learn the corresponding hashing
functions. As a result, our method is able to fully exploit both the spatial information within
each frame and the temporal relationship between different frames.
Compared with our previous work in [18], we utilize two pooling methods to obtain the
spatio-temporal representations of videos and discuss which one enables better performance
for video hashing. We also conduct more experiments to demonstrate the effectiveness of
the proposed framework. The comparison between the proposed framework and the state-
of-the-art video hashing methods is also performed in the paper.
The rest of our paper is organized as follows. The details of our approach are described
in Sect. 2. Our approach is empirically evaluated on the real datasets in Sect. 3. Finally, we
conclude the entire paper in Sect. 4.
2 Methodology
The proposed unsupervised video hashing framework is shown in Fig. 1, which comprises
four components: CNN, LSTM, time series pooling layer, and unsupervised hashing function
learning component. In this section, we first define the notations used in our formulation,
then discuss the details of the framework.
123