Image Classification by Exploiting the Spatial Context Information
Song Yan
1
, Dai Li-Rong
1
, Yu Li
2
Depart. of Electronic Engineering, University of Science and Technology of China, China
1
Hefei TV Station
2
{songy,lrdai}@ustc.edu.cn, yuulii@163.com
Abstract
Finding the effective image representation is an
important problem for classification. Previous ap-
proaches have demonstrated the utility of the
bag-of-feature (BoF) models. These methods are inter-
esting due to the computational efficiency and concep-
tual simplicity. However, it is achieved by discarding
the spatial context information. Furthermore, it may
suffer from the quantization error introduced by the
hard quantization of local features. To address these
issues, we proposed an effective image representation
that exploits the spatial context information. Specifi-
cally, the visual codebook is constructed on the pair-
wise descriptors lied in spatial neighborhoods which
can capture the near-context information, and the spa-
tial pyramid structure is further combined to capture
the far-context information. Then for image classifica-
tion, an effective soft quantization method is proposed,
which can accurately represent the original features by
the regression of neighboring visual words. To evaluate
the effectiveness of the proposed method, we compared
it with existing BoF representations on benchmark da-
tasets including Scenes-15 and Caltech 101 in image
classification. The experimental results demonstrate
the superiority of the proposed method compared with
state-of-the-art methods.
1. Introduction
Image classification is an important and challeng-
ing task in computer vision community. The major
difficulty may be on how to find the effective image
representation that can address the large intra-class
variations, such as the change of viewpoints, visibility,
illumination, and background clutter, in addition to
inter-class variability
[1]
.
Previous approaches have demonstrated some
promising results from the representation based on
local descriptors, such as SIFT[2] and HoG[3]. The
idea is to describe an image by the bag-of-features
This work is supported by Nature Science Foundation of China
(NSFC, Grant No. 61172158), and Anhui Provincial Natural Science
Foundation (Grant No. 090412056).
(BoF) representation, in the spirit of the bag-of-words
models used in text analysis [4, 5]. Specifically, the
visual codebook is first constructed offline by unsu-
pervised clustering algorithms (e.g. k-means). The
resulting cluster centroid is usually referred as visual
words. By assigning each local feature to its nearest
visual word and counting the occurrence, a new im-
age is represented as fixed-length histogram vectors.
The BoF model is interesting due to its computa-
tional efficiency and conceptual simplicity. However,
it is achieved by treating the image as an orderless
collection of the visual words. Some recent works,
such as spatial pyramids [6], visual synset[7],
high-order spatial feature [8], show that capturing
some degree of spatial context information may help
to improve performance over the pure BoF models.
Generally, these methods assume that the visual
codebook is already learned, and the local features are
approximated by the nearest visual words before con-
sidering the high-order spatial context information.
The number of the visual word combinations grows
near quadratically with respect to the size of the visu-
al codebook, which results in the high dimensional
image representation. In the case of a few training
images available, the classifier may over-fit and fail
to generalize to test set.
Furthermore, it is known that the traditional visual
word may suffer from the quantization error (i.e. the
difference between the original features and its as-
signed visual words)
[9]
. The features with large quan-
tization error lie around the boundary among visual
word. These features that should be considered to be
matched with each other may be assigned to different
visual words after quantization, and leads to the mis-
match. This mismatch may be magnified in the visual
words combination. To reduce the quantization error,
several soft quantization based methods are proposed
recently [13, 14], which aim at representing the orig-
inal features by K nearest visual words.
In this paper, we propose an effective image rep-
resentation that exploits the spatial context infor-
mation to address these problems. Firstly, the spatial
context visual codebook is constructed based on the
pairwise descriptors lied in spatial neighborhoods,