ASK THE DICTIONARY: SOFT-ASSIGNMENT LOCATION-ORIENTATION POOLING FOR
IMAGE CLASSIFICATION
Qilong Wang
1
, Xiaona Deng
1
, Peihua Li
1
, Lei Zhang
2
1
Dalian University of Technology,
2
The Hong Kong Polytechnic University
ABSTRACT
The pooling step is one of the key components of the well-
known Bag-of-visual words (BoW) model widely used in
image classification. In this paper, we propose a novel
pooling method, which is called Soft-Assignment Location-
Orientation Pooling (SALOP). Inspired by the bag of sta-
tistical sampling analysis (Bossa), SALOP also explores the
effect of dictionary for pooling method, but leverages both lo-
cation and orientation information between the local descrip-
tors and the atoms of dictionary to aggregate feature codes.
Moreover, different from existing pooling methods, SALOP
employs a soft-assignment pooling scheme to handle ambi-
guity and uncertainty existing in the pooling process. The
evaluation is conducted on two image benchmarks: Scene15
and PASCAL VOC 2007. The experimental results show our
SALOP can achieve promising performances.
Index Terms— Image classification, Bag-of-visual words,
soft-assignment, location-orientation pooling, dictionary
1. INTRODUCTION
Bag-of-visual words ( BoW) model [20] has been successfully
applied to various image and vision tasks, especially in image
classification. As illustrated in Fig. 1, BoW model involves
local feature extraction, dictionary learning (e.g. k-means),
feature encoding, pooling, and classification. Among them,
the pooling process aggregates feature codes into a final im-
age representation, which has great influence on classification
performance [9], and has attracted a lot of attentions in recent
years. The lines of research on pooling algorithms can be
roughly divided into four groups based on the aforementioned
components of BoW model.
Feature level based methods design pooling operations
from the point of view of the local feature space. Boureau
et al. [4] proposed a pooling method to aggregate feature
codes based on feature space determined by clustering algo-
rithm, which puts codes from similar local features together,
and performs pooling operation on codes in each cluster of
local features. Following [4], Fanello et al. [19] divided fea-
ture space into 𝐾 bins (𝐾 is the number of classes), where
The work was supported by the National Natural Science Foundation of
China (61471082, 61405022).
d
m
d
m
Localized soft-
assignment
coding
Image
Classification
(b) BOSSANOVA
Feature
encoding
Pooling step
Local feature
extraction
d
1
d
2
d
3
d
4
d
m
Dictionary learning
e.g. k-means
e.g. SVM
h
m
^
+
h
m
wN
m
wN
m
f
^
(a) SALOP
9090909090909090
...
d
m
f
...
Fig. 1. The classification paradigm used in this paper. Our
main contribution is to propose a novel pooling method SA-
LOP which is shown in (a). For an input descriptor, we com-
pute its location and orientation to neighboring atoms of dic-
tionary, and perform pooling operations based on joint distri-
bution of location and orientation. (b) shows the methodology
of BossaNova[1] - only location information is employed.
a weighted max pooling was performed. The weight in each
bin is provided by score of SVM classifier learned on training
features.
Code level based methods contain two basic and widely
used pooling operations: average or max pooling, and a lot of
works are proposed to improve them. [3, 7] employ a ℓ
𝑝
norm
to make a trade-off between average and max pooling, and pa-
rameter 𝑝 can be learned from the training data [7]. Alterna-
tive effective improved (@n) pooling [12] can also be seen as
a trade-off between average and max pooling, but more flexi-
ble and robust. Refer to [12] for more details and comparison
of code-level pooling methods.
Image level pooling methods can be traced back to spatial
pyramid matching (SPM) [14], which introduces spatial infor-
mation into BoW model by dividing image into some regular
parts. [7, 23] learned weighted pooling t o exploit geometric
structure of image. Along this line, Cao et al. [5] proposed