利用多VLAD编码提升CNN图像分类性能

下载需积分: 9 | PDF格式 | 1.03MB | 更新于2024-09-08 | 200 浏览量 | 举报

"这篇文章主要探讨了使用多个局部聚合描述符（VLAD）编码方法与卷积神经网络（CNN）特征相结合进行图像分类的问题。作者旨在通过改进VLAD编码方法来提高其性能，他们通过扩展三种编码算法来探索VLAD编码的多样性，并在VLAD编码上应用空间金字塔补丁（SPM）以向CNN特征添加空间信息。这种结合使得他们的框架相比于传统方法能取得更好的表现。" 在图像分类任务中，尽管卷积神经网络（CNNs）表现出色，但其对学习到的表示的影响力仍然有限，主要集中在图像的主要对象上，而忽略了背景杂乱和局部物体的变异信息。为了克服这一限制，作者提出了一个利用CNN特征的多重VLAD编码方法。VLAD是一种将局部特征聚合成全局图像描述符的技术，它能够捕捉到图像中不同区域的统计信息。文章重点介绍了如何增强VLAD编码的方法。首先，他们扩展了三种编码算法，这些算法可能包括不同的量化策略或编码方式，以增加VLAD编码的多样性，从而更好地捕获图像中的复杂模式和差异。通过这种方法，他们期望能够更全面地利用CNN提取的特征。其次，引入了空间金字塔补丁（SPM）的概念，这是计算机视觉领域中用于捕获图像空间结构的一种技术。SPM将图像分割成多个层次的金字塔结构，允许模型在不同尺度上处理信息。在VLAD编码上应用SPM，可以将图像的空间上下文信息与CNN特征相结合，进一步提升分类性能。这有助于模型理解图像中的相对位置关系，增强对局部细节和整体布局的敏感性。最后，文章指出，通过结合这些改进，他们的框架在图像分类任务上相对于仅使用CNN或传统VLAD编码的方法表现出了优越性。这表明，深度卷积神经网络与多VLAD编码以及SPM的集成可以有效地增强图像识别的精度和鲁棒性，特别是在处理复杂和多变的视觉场景时。这篇论文深入研究了如何利用CNN和VLAD编码的组合来增强图像分类能力，特别关注了增加编码多样性和引入空间信息这两个关键方面。这样的方法对于改善基于深度学习的图像识别系统有着重要的理论和实际意义。

INTELLIGENT INFORMATION PROCESSING, PART 1

Moreover, spatial pyramid matching (SPM), a traditional model for BoVW, has been success-

fully fed into the deep conventional networks. Motivated by SSPnet

and Fast R-CNN,

the spa-

tial information of the local CNN feature is very important. Therefore, we proposed adding an

SPM layer before the VLAD encoding layer in our framework, which we call the multiple

VLAD encoding method equipped with SPM with CNN features for image classification. Fol-

lowing this new framework can allow the capture of more accurate and robust local CNN fea-

tures for better classification performance than the existing method.

In summary, the primary contributions of this article are the following:

• We introduce a special framework called the multiple VLAD encoding method

equipped or not equipped with SPM with CNN features for image classification.

• We explore the multiplicity of VLAD encoding with the extension of several kinds of

encoding algorithms. We develop three coding methods, VLADSA, VLAD-LSA, and

VLAD-LLC. We also empirically illustrate boosting the performance of classification

with VLAD-SA, VLAD-LSA or VLAD-LLC.

RELATED WORK

After reviewing the vast literature on image classification, we understood it to be a very chal-

lenging problem that has gained much attention over the years. One milestone was established

by using the low-level features in the BoVW model, such as SIFT, which is a very robust local

invariant feature descriptor with respect to geometrical changes.

BoVW one of the classical models of the computer vision society, has proven to be popular and

successful in image classification.

5,6

BoVW originated from bag-of-words model in natural language processing, and represents an

image as a collection of local features. It has been widely used in instance retrieval, scene recog-

nition, and action recognition. Traditionally, vector quantization (hard voting), the most repre-

sentative encoding method, is one key step in constructing the BoVW model. Over the past

several years, a large variety of different feature coding methods have been highly active re-

search areas. For example, to solve the L1-norm optimization problem, Jianchao Wang and col-

leagues developed locality-constrained linear coding (LLC).

For more large-scale image

categorization, super vector encoding methods have obtained state-of-the-art performance in sev-

eral tasks, especially for the typical methods VLAD

and Fisher Vector (FV).

Because super

vector encoding methods have achieved powerful performance on computer vision tasks,

explored VLAD encoding methods for use in our framework.

Recently, the state-of-the-art technique of image classification has been CNN, which is increas-

ingly used in diverse computer vision applications. Generally, CNN architecture consists of three

layers: convolutional, pooling, and fully connected. Many researchers have enhanced the archi-

tecture of CNNs by changing the specific components in different layers. For example, Yunchao

Gong and colleagues

presented a multiscale orderless pooling scheme (MOP-CNN), which ex-

tracts CNN activations for local patches at multiple scale levels, and performs orderless VLAD

pooling of these activations at each level separately. Zhun Sun and colleagues explored the rela-

tionship between shape of kernels that define receptive fields (RFs) in CNNs for learning feature

representations and image classification.

Because deep CNNs can be trained in a layer-by-layer manner, CNNs are extracted to improve

the robustness of learning features and obtain higher-level image information. Therefore, CNNs

as feature extractors are investigated by authors in numerous research areas. Ruobing Wu and

colleagues

presented a novel pipeline built on deep CNN features for harvesting discriminative

visual objects and parts for scene classification. Dmitry Laptev and colleagues

proposed a deep

neural network topology that incorporates a simple-to-implement transformation-invariant pool-

ing operator (TI-POOLING). Unfortunately, CNN features mostly focus on the salient object of

March/April 2018 www.computer.org/cise

剩余11页未读，继续阅读

deshenglin

粉丝: 0

利用多VLAD编码提升CNN图像分类性能

戴牧红-oracle-数据库-在线考试

Unity3D一个打地鼠游戏的De'moemo

Multiple Wavelet Coefficients Fusion in Deep Residual Networks

A Novel Codebook Representation Method and Encoding Strategy for Bag-of-Words Based Acoustic Event Classification

vlad：vlad图像处理算法

vlad.korostel

vlad_landing

VLAD经典论文

CNN-for-Image-Retrieval, post"Image retrieval using MatconvNet and pre trained imageNet"的代码.zip

MALTAB VLAD图像匹配

最新资源