Maximal Margin Feature Mapping via Basic Image
Descriptors for Image Classification
Changchen Zhao
1,2
Chun-Liang Lin
1
Weihai Chen
2
1
Department of Electrical Engineering, National Chung Hsing University, Taichung, 40227, Taiwan
2
School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China
chunlin@dragon.nchu.edu.tw
Abstract—Computer vision is one of the most important
branches in modern industrial technology. Image classification
plays an important role in computer vision since it utilizes
the most advance technique in this area. However, most image
classification methods only use the SIFT feature for further pro-
cessing, which hinders the rich useful low-level image attributes
to be captured. This paper proposes a maximal margin feature
mapping framework that incorporates basic descriptors in the
recognition system. This is fulfilled by optimizing an objective
function that minimizes intra-class distance and maximizes inter-
class distance as well as the reconstruction error. An efficient
optimization algorithm is proposed to learn the transformation
matrix. Experiments on three publicly available datasets are
conducted. The preliminary results show the effectiveness of the
proposed approach.
Index Terms—image classification, maximal margin, feature
mapping, non-convex optimization
I. INTRODUCTION
Recent advance of computer vision technology has benefit-
ted the industrial technology in the area such as vision-based
manipulator [1], video surveillance system [2], and vision
industrial imaging [3]. Image classification has been an active
computer vision research area over the past few decades. Also,
it is an important application of machine learning, pattern
recognition. It is a combination of techniques such as feature
extraction, feature encoding, classifier learning, etc. First,
given an input image, various image features are extracted
to capture basic image features such as color, gradient and
intensity. Then, feature encoding method is employed to
generate the image-level representation, which is supposed to
be as discriminative as possible. Finally, a classifier is trained
to assign the new input image a category label.
Image feature plays an important role among the afore-
mentioned techniques. A variety of image features have
been proposed to capture low-level image attributes such
as intensity, illumination, color, gradient, etc. Scale-invariant
feature transform (SIFT) [4] is a famous local image feature
in computer vision to detect and describe local features in
images. SIFT can robustly identify key points of an object even
among clutter and under partial occlusion, because the SIFT
feature descriptor is invariant to uniform scaling, orientation,
and partially invariant to affine distortion and illumination
changes. It has been widely used in image retrieval [5], object
recognition [6], visual tracking [7], and, of course, image
classification.
Recent works employed dense SIFT as the images fea-
tures. Different from original SIFT descriptor, dense SIFT
is extracted on image patches divided by densely sampled
grids. These algorithms attempt to use dense SIFT to capture
low-level features of an object for further processing. These
algorithms were proposed based on the assumption that the
SIFT feature forms the preliminary image pattern. These
patterns are partitioned (usually by K-means clustering) in
several clusters, the centroids of these clusters are regarded as
the basic patterns of images. These patterns form a dictionary.
For an image, several patterns are activated via different
feature encoding methods. Some popular encoding methods
include vector quantization [8], kernel codebook encoding [9],
locality constrained linear coding (LLC) [10], Fisher encoding
[11], supervector encoding [12]. The image representation is
generated based on these patterns. Different way of activating
of patterns forms the discriminative performance of image
representation. Hence, image classification is fulfilled by the
discriminative power of these image representations.
However, SIFT feature has its own limitations. First, dense
SIFT not necessarily captures as rich patterns of the image as
other low-level image features. SIFT is the key point descriptor
that may exist in the edge or corner of an object as well as
invariant to translation and rotation. It exhibits limitations in
capturing color, gradient or intensity features. However, these
image attributes are crucial in identifying objects. Second,
dense SIFT may exist ambiguity. SIFT extracted from the
same image pattern may separate far away from each other
while SIFT extracted from different patterns may aggregate
in the feature space. These limitations block severely the
discriminative power of image representation if one only uses
SIFT feature as the basic image pattern.
In this paper, we aim at learning a mapping function that
maps SIFT features to a high-dimensional feature space such
that the basic assumption in machine learning holds, i.e.,
features extracted from the same pattern are aggregated and
those from different patterns are separated. The mapping
function is formulated by a convex optimization problem
analogous to the auto-encoder. It has three layer of neurons.
The first layer takes the SIFT feature as input, the second
layer is the mapped feature, and the output layer reconstructs
the input layer by minimizing the reconstruction error. The
main objective is accomplished by imposing constraints on
the hidden layer via other low-level image features, e.g., HOG
l-)))