INTELLIGENT INFORMATION PROCESSING, PART 1
Moreover, spatial pyramid matching (SPM), a traditional model for BoVW, has been success-
fully fed into the deep conventional networks. Motivated by SSPnet
7
and Fast R-CNN,
4
the spa-
tial information of the local CNN feature is very important. Therefore, we proposed adding an
SPM layer before the VLAD encoding layer in our framework, which we call the multiple
VLAD encoding method equipped with SPM with CNN features for image classification. Fol-
lowing this new framework can allow the capture of more accurate and robust local CNN fea-
tures for better classification performance than the existing method.
In summary, the primary contributions of this article are the following:
• We introduce a special framework called the multiple VLAD encoding method
equipped or not equipped with SPM with CNN features for image classification.
• We explore the multiplicity of VLAD encoding with the extension of several kinds of
encoding algorithms. We develop three coding methods, VLADSA, VLAD-LSA, and
VLAD-LLC. We also empirically illustrate boosting the performance of classification
with VLAD-SA, VLAD-LSA or VLAD-LLC.
RELATED WORK
After reviewing the vast literature on image classification, we understood it to be a very chal-
lenging problem that has gained much attention over the years. One milestone was established
by using the low-level features in the BoVW model, such as SIFT, which is a very robust local
invariant feature descriptor with respect to geometrical changes.
6
BoVW one of the classical models of the computer vision society, has proven to be popular and
successful in image classification.
5,6
BoVW originated from bag-of-words model in natural language processing, and represents an
image as a collection of local features. It has been widely used in instance retrieval, scene recog-
nition, and action recognition. Traditionally, vector quantization (hard voting), the most repre-
sentative encoding method, is one key step in constructing the BoVW model. Over the past
several years, a large variety of different feature coding methods have been highly active re-
search areas. For example, to solve the L1-norm optimization problem, Jianchao Wang and col-
leagues developed locality-constrained linear coding (LLC).
8
For more large-scale image
categorization, super vector encoding methods have obtained state-of-the-art performance in sev-
eral tasks, especially for the typical methods VLAD
9
and Fisher Vector (FV).
10
Because super
vector encoding methods have achieved powerful performance on computer vision tasks,
11
we
explored VLAD encoding methods for use in our framework.
Recently, the state-of-the-art technique of image classification has been CNN, which is increas-
ingly used in diverse computer vision applications. Generally, CNN architecture consists of three
layers: convolutional, pooling, and fully connected. Many researchers have enhanced the archi-
tecture of CNNs by changing the specific components in different layers. For example, Yunchao
Gong and colleagues
11
presented a multiscale orderless pooling scheme (MOP-CNN), which ex-
tracts CNN activations for local patches at multiple scale levels, and performs orderless VLAD
pooling of these activations at each level separately. Zhun Sun and colleagues explored the rela-
tionship between shape of kernels that define receptive fields (RFs) in CNNs for learning feature
representations and image classification.
12
Because deep CNNs can be trained in a layer-by-layer manner, CNNs are extracted to improve
the robustness of learning features and obtain higher-level image information. Therefore, CNNs
as feature extractors are investigated by authors in numerous research areas. Ruobing Wu and
colleagues
13
presented a novel pipeline built on deep CNN features for harvesting discriminative
visual objects and parts for scene classification. Dmitry Laptev and colleagues
14
proposed a deep
neural network topology that incorporates a simple-to-implement transformation-invariant pool-
ing operator (TI-POOLING). Unfortunately, CNN features mostly focus on the salient object of
54
March/April 2018 www.computer.org/cise