978-1-4673-0311-8/12/$31.00 ©2012 IEEE
Novel Spatial Pyramid Matching for Scene
and Object Classification
Kai Ding, Weihai Chen ,Xingming Wu, Zhong Liu
School of Automation Science and Electrical Engineering,
Beijing University of Aeronautics and Astronautics
Beijing, P.R.China.
838383_dingkai@163.com, whchenbuaa@126.com, wuxingming307@126.com, liuzhong@buaa.edu.cn
Abstract—It is difficult to classify object or scene images with
high accuracy when the dataset is relatively large. Spatial
Pyramid Matching (SPM) was proposed to deal with this
problem, but there are some shortages. As an improvement for
SPM, we proposed three pieces of meliorations: first, use
approximate nearest neighbor method instead of k-means for
clustering; second, regulate the size of codebook referring to
quantity and pixels of the images, by calculating sub-codebook
for every category and eliminating the codes which are nearer to
the registered ones than the threshold; third, rescale the
histogram features, and classify the scene with hierarchical
strategy. Experiments prove that our approach make better
performance than other state-of-the-art classification methods
using just one matching kernel.
Keywords—Object Classification, Spatial Pyramid Matching,
ANN clustering.
I. INTRODUCTION
Scene and object classification is a high-level semantic
analysis in computer vision, and remain great challenging jobs
in the field, especially scene classification. Supposing that we
take a picture of a square, there may be some people in it,
some buildings and some kind of plants, but we call it square
briefly, that’s ‘scene’; if we deal with a dataset containing
several kind of scenes, and we are trying to identify which
category one picture belongs to by machine learning, that’s
‘scene categories classification’. By this method, we can get
an approximate predication for the category of an image,
ignoring many details in it. This method is the analog as a
person staring at natural scene that’s far away, trying to tell
what he is looking at. And so it’s the inspiration of Gist came
from, proposed by Torralba and Olive [1]. Object and scene
classification algorithm now is playing more and more
important roles in artificial intelligent system such as
autonomous mobile, cargo sorting, transportation monitoring,
and household robotics etc., some other applications could be
found in augment simulation or data compression techniques.
A wide range of algorithms have been proposed to tackle
this problem. Space subdivision and histogram method are
representative approaches in early phase. The features they
adopted to identify objects were color, edges, patches, which
are sensitive to illumination, scaling and affine distortion, and
classification accuracy was stuck at a low level. Then local
descriptors with illumination or scale invariance were
proposed, such as Harris and SIFT points [9]. These features
lead to prosperity in multi-images processing research. Some
notable progresses emerged, for example L.Feifei and
K.Grauman’s work. L.Feifei developed a bag-of-words(BoW)
method dealing with scene classification and object
recognition, here ‘feature’ means dense SIFT which is better
than SIFT feature, based on her comparative evaluation [2].
Her job had strong influnce on subsequent studies, like KNN-
SVM by Zhang.H, Spatial Pyramid Matching(SPM) by
S.Lazebnik and Spatial Pyramid Kernels by A.Bosch etc.
[4][5][6]. K.Grauman proposed Pyramid Matching method
which is also based on SIFT feature, resulting in a histogram
for each image using weighted intersection method on multi
resolution [3]. But neither of them took full advantage of
position information, what they focus on is the probability of
every matching feature appears in the unlabeled images. How
important is position information in image classification?
S.Lazebnik proposed spatial pyramid matching algorithm, and
simple position information was preserved by creating ordinal
regular-grid feature vectors, experiments shown remarkable
increase in classification accuracy comparing methods without
position information [5]. However the role of position
information is subtle, if too much is preserved, then the
adorable detail overlooking property will disappear, accuracy
may fall, so position information can only be auxiliary factor
in this condition. Recently, multi-kernels matching algorithms
became popular, that combining SIFT and Pyramid Histogram
of Oriented Gradients etc., whose result accuracy are higher
[7]. But in this paper, we only discuss image classification
algorithms with single kernel.
Because of impressive performance, spatial pyramid
approach is introduced to other image processing methods. But
when we study this algorithm, we find some shortages in large
dataset supporting and image representation. First, they used
k-means clustering in spatial pyramid, which could mistakenly
cluster uneven distributed dataset. Second, they generated
codebook by processing all the image features together, some
minority centers may be annexed which could restrain
recognition of related scene. Third, resulted histogram should
be preprocessed before training, because this can improve
classification accuracy, but they didn’t mention it.
In this paper we propose novel spatial pyramid approach,
using approximate nearest neighbor method to cluster data
978-1-4673-0311-8/12/$31.00 ©2012 IEEE 172