深度学习与SIFT：图像检索中的互补搭档

需积分: 50 97 浏览量更新于2024-08-13 收藏 1.59MB PDF 举报

本文探讨了在过去的十年里，SIFT（尺度不变特征变换）在图像检索等视觉任务中占据主导地位，而近年来，深度卷积神经网络（CNN）由于在图像分类和对象检测等任务上展现出卓越的性能，引发了关于它们在图像检索中是否可以替代SIFT的讨论。文章的核心观点是CNN和SIFT并非互相排斥，而是高度互补的。首先，作者通过实验验证了CNN和SIFT在处理图像检索时各有优势。CNN擅长捕捉图像的高层抽象特征，如物体的形状、纹理和结构，这使得它们在大规模数据集和复杂场景下表现出色。而SIFT则以其局部特征描述符和尺度不变性，在细节识别和稳定性的结合上独具一格，尤其在小样本和低光照条件下依然有效。基于这些发现，作者提出了一种名为“互补CNN与SIFT”（Complementary CNN and SIFT, CCS）的图像表示模型。该模型旨在将CNN和SIFT的优势整合到一个多层次、互补的方式中。在CCS模型中，CNN负责提取全局和高级别的特征，同时SIFT用于增强对局部特征的精确匹配，确保了对图像的全面理解和检索能力。这种方法允许在保持高精度的同时，兼顾了速度和鲁棒性，实现了对传统方法的有效补充。通过一系列严谨的实验，作者展示了CCS模型在图像检索任务中的优越性能，它不仅提高了召回率，还降低了误匹配的可能性。此外，研究还讨论了如何优化融合策略，如权重分配和特征选择，以进一步提升整体性能。这篇研究论文强调了在图像检索中，CNN和SIFT不是简单的替代关系，而是通过互补融合，可以创造出更为强大和鲁棒的图像表示和检索系统。这对于推动计算机视觉领域的研究和发展具有重要意义，特别是在寻求在实际应用中综合多种技术优势来提高性能和效率的背景下。

CNN vs. SIFT for Image Retrieval: Alternative or

Complementary?

Ke Yan

1,2

, Yaowei Wang

∗,3

, Dawei Liang

1,2

, Tiejun Huang

1,2

, Yonghong Tian

∗∗,1,2

National Engineering Laboratory for Video Technology, School of EE&CS,

Peking University, Beijing, China

Cooperative Medianet Innovation Center, China

Department of Electronic Engineering, Beijing Institute of Technology, China

{keyan, dwliang, tjhuang, yhtian}@pku.edu.cn;yaoweiwang@bit.edu.cn

ABSTRACT

In the past decade, SIFT is widely used in most vision tasks

such as image retrieval. While in recent several years, deep

convolutional neural networks (CNN) features achieve the

state-of-the-art performance in several tasks such as image

classiﬁcation and object detection. Thus a natural question

arises: for the image retrieval task, can CNN features sub-

stitute for SIFT? In this paper, we experimentally demon-

strate that the two kinds of features are highly complemen-

tary. Following this fact, we propose an image representa-

tion model, complementary CNN and SIFT (CCS), to fuse

CNN and SIFT in a multi-level and complementary way. In

particular, it can be used to simultaneously describe scene-

level, object-level and point-level contents in images. Ex-

tensive experiments are conducted on four image retrieval

benchmarks, and the experimental results show that our

CCS achieves state-of-the-art retrieval results.

Keywords

Multi-level image representation; CNN; SIFT; Complemen-

tary CNN and SIFT (CCS)

1. INTRODUCTION

Scale-invariant feature transform (SIFT) [1] has been the

most widely-used hand-crafted feature for content-based im-

age retrieval (CBIR) in the past decade. Technologically,

SIFT is intrinsically robust to geometric transformations

and shows good performance for near-duplicate image re-

trieval [2] [3]. Meanwhile, there are also many works (e.g.,

Fisher vector [4], VLAD [5] and their variants [6] [7] [8]),

that attempt to construct semantically-richer mid-level im-

age representations so as to improve the retrieval perfor-

mance. However, in spite of signiﬁcant eﬀorts, it is still

diﬃcult to fully bridge the semantic gap between such fea-

∗

Corresponding author: Yaowei Wang and Yonghong Tian.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

MM ’16, October 15-19, 2016, Amsterdam, Netherlands

 2016 ACM. ISBN 978-1-4503-3603-1/16/10. . . $15.00

DOI: http://dx.doi.org/10.1145/2964284.2967252

Figure 1: The demonstration of our proposed com-

plementary CNN and SIFT (CCS). The CCS aggre-

gates three level contents, i.e., scene-level, object-

level and point-level representations.

ture representations and human’s understanding of an image

only with SIFT-based features.

Recently, deep convolutional neural networks (CNN) have

achieved the state-of-the-art performance in several tasks,

such as image classiﬁcation [9] [10], object detection [11] [12]

and saliency detection [13]. Compared with hand-crafted

features, CNN features learned from numerous annotated

data (e.g., ImageNet [14]) in a deep learning architecture,

carry richer high-level semantic information. Several at-

tempts in CBIR [15] [16] [17] showed that CNN features

work well for image retrieval as a scene-level representation.

Gong et al. [18] proposed an approach called Multi-scale

Orderless Pooling (MOP) to represent local information by

aggregating CNN features at three scales respectively. Kon-

da et al. [19] and Xie et.al. [20] detected object proposals

and extracted CNN features for each region at the object-

level. Besides, there are also some researchers who paid

attention to deep convolutional layers to derive representa-

tions [21] [22] [23] [24] for image retrieval. Although CNN

features achieve good performance, we can not say that C-

NN will always outperform SIFT yet. Vijay et al. [25] had

showed that no one was better than the other consistently

and the retrieval gains can be obtained by combining the

two features.

407

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38622427

粉丝: 0
资源: 951

深度学习与SIFT：图像检索中的互补搭档

SIFT SURF算法的比较

基于深度学习的SIFT图像检索算法

SIFT算法和卷积神经网络算法在图像检索领域的应用分析.docx

在监视场景中学习用于小对象检索的多视图深度功能

国科大-多媒体分析与理解-2018期末试题

SIFT与SURF算法在图像特征提取中的应用

OpenCV图像匹配的跨平台实现：从桌面到移动，无缝连接

多模态学习：视觉与语音的融合

OpenCV多目标模板匹配最新进展：探索算法创新与突破

深度学习中的多模态融合方法与案例研究

最新资源