深度学习驱动的无监督视频哈希框架

149 浏览量更新于2024-08-26 收藏 754KB PDF 举报

"通过深度神经网络进行无监督视频散列" 在多媒体检索领域，哈希是一种高效的方法，它通过将高维特征向量转化为短二进制代码，使得相似内容能够被快速查找。以往的研究主要集中在图像哈希上，但视频哈希的挑战更大，因为视频不仅有空间结构，还有时间序列的关联性。一些研究者尝试通过关键帧编码来解决这个问题，然而，这种方法在实际应用中计算量大，效率低。另一种方案是用所有帧的平均空间特征来代表视频，但这忽略了帧间的时间相关性。本文提出了一种新颖的无监督视频散列框架，它充分利用了深度神经网络（尤其是卷积神经网络CNN）来捕获视频的空间特征，并利用长期短期记忆网络（LSTM）来建模时间特征。CNN能有效地从每一帧中提取视觉特征，而LSTM则能捕捉帧间的动态变化，形成对视频内容的时空理解。接下来，通过时间序列合并策略，将这些时空特征融合成一个单一的特征向量，这使得视频的复杂性和多样性得以有效表示。实验结果在两个真实数据集上验证了所提方法的有效性。相比于只使用空间特征的现有哈希方法，该方法显著提升了性能，平均平均精度更高，表现出在视频哈希任务中的优越性。这表明，结合时间结构和空间结构的哈希方法能够更好地保留视频的时空信息，从而提高检索的准确性和效率。这项工作为视频检索提供了一个强大的工具，尤其是在大规模视频数据集的场景下。未来的研究可能会进一步优化这种时空特征的融合策略，或者探索其他类型的神经网络结构来增强视频哈希的效果。此外，对于实时视频检索和监控应用，这种无监督的深度学习方法也具有广阔的应用前景。

Unsupervised Video Hashing via Deep Neural Network

submodular hashing framework to index videos, which represented the video by the average

of individual frames. However, the speciﬁc temporal structure between frames is neither

considered nor encoded into the binary codes, thus the temporal information may be lost

[36]. Moreover, these methods are based on hand-crafted features designed for static images,

e.g., GIST [20] and SIFT [17], which are not suitable for video hashing. Inspired by the

great success of deep learning, Wu et al. [36] proposed an unsupervised deep video hashing

method which extracts the video features through the deep neural network and then learns

the hash function in an end-to-end way. However, the proposed framework only takes the

quantization error and the variance balance into consideration, whose motivation is much

similar to ITQ-CCA [6]. The expandability of the proposed framework to other hashing

algorithms could not be guaranteed.

Recently, it has been found that the visual recognition can be signiﬁcantly boosted by deep

neural network [4,7,14,24,31,34,41,43]. On one hand, convolutional neural network (CNN)

is effective in learning spatial structure of images, thus it has been exploited in many computer

vision applications [14,24,43]. By utilizing the feature vectors generated by the seventh layer

of CNN, the method proposed by [14] achieved the state-of-the-art performance in image

retrieval on ImageNet dataset [3]. On the other hand, recurrent neural network (RNN) is well

known to be “deep in time”, which is able to form implicit compositional representations in

the time domain [4]. Long-short term memory (LSTM) [10], a successful variant of RNN,

has shown the state-of-the-art performance in video classiﬁcation and caption [4,7,34,41].

The combination of CNN and LSTM can provide both spatial feature of each frame and

temporal correlation between successive frames, and thus it has been utilized for many tasks.

The advantages of the deep neural networks inspire us to apply them to video hashing tasks.

In this paper, we construct an unsupervised hashing framework that is composed of four

key components: CNN, LSTM, time series pooling layer, and unsupervised hashing function

learning component. To be speciﬁc, the spatial features of videos are obtained by utilizing

CNN, and the temporal features are established via LSTM network. In order to obtain a

single hashing code for each video, we adopt time series pooling strategy to pool the frame-

level features into the video-level features. After that, the obtained feature vectors are fed

into unsupervised hashing function learning component to learn the corresponding hashing

functions. As a result, our method is able to fully exploit both the spatial information within

each frame and the temporal relationship between different frames.

Compared with our previous work in [18], we utilize two pooling methods to obtain the

spatio-temporal representations of videos and discuss which one enables better performance

for video hashing. We also conduct more experiments to demonstrate the effectiveness of

the proposed framework. The comparison between the proposed framework and the state-

of-the-art video hashing methods is also performed in the paper.

The rest of our paper is organized as follows. The details of our approach are described

in Sect. 2. Our approach is empirically evaluated on the real datasets in Sect. 3. Finally, we

conclude the entire paper in Sect. 4.

2 Methodology

The proposed unsupervised video hashing framework is shown in Fig. 1, which comprises

four components: CNN, LSTM, time series pooling layer, and unsupervised hashing function

learning component. In this section, we ﬁrst deﬁne the notations used in our formulation,

then discuss the details of the framework.

123

剩余13页未读，继续阅读

weixin_38546024

粉丝: 6

深度学习驱动的无监督视频哈希框架

利用深度神经网络的无监督视频表示.pdf

深度哈希网络：无监督域自适应新方法

图神经网络在无监督学习中的应用

Matlab实现深度神经网络：MNIST分类任务

深度神经网络驱动的多尺度特征提取技术

深度神经网络SOTA模型库及其性能优化技巧

零基础深度学习教程：从感知器到深度神经网络

深度解析：机器学习中的监督学习、无监督学习与循环神经网络

深度神经网络提升语音分离质量，清除噪声干扰

深度神经网络RESNET在自动调制识别中的高效应用

最新资源