A Scale-Invariant Framework For Image
Classification With Deep Learning
Yalong Jiang
1
, Zheru Chi
1,2
1. Department of Electronic and Information Engineering, the Hong Kong Polytechnic University, Hong Kong SAR, China
2. PolyU Shenzhen Research Institute, Shenzhen, China
yalong.jiang@connect.polyu.hk, chi.zheru@polyu.edu.hk
Abstract—In this paper, we propose a scale-invariant
framework based on Convolutional Neural Networks (CNNs).
The network exhibits robustness to scale and resolution
variations in data. Previous efforts in achieving scale invariance
were made on either integrating several variant-specific CNNs or
data augmentation. However, these methods did not solve the
fundamental problem that CNNs develop different feature
representations for the variants of the same image. The topology
proposed by this paper develops a uniform representation for
each of the variants of the same image. The uniformity is
acquired by concatenating scale-variant and scale-invariant
features to enlarge the feature space so that the case when input
images are of diverse variations but from the same class can be
distinguished from another case when images are of different
classes. Higher-order decision boundaries lead to the success of
the framework. Experimental results on a challenging dataset
substantiates that our framework performs better than
traditional frameworks with the same number of free parameters.
Our proposed framework can also achieve a higher training
efficiency.
Keywords—convolutional neural networks; robustness to scale
variations; scale invariance; higher-order decision boundaries
I. INTRODUCTION
The advantage of Convolutional Neural Networks (CNNs)
over traditional machine learning techniques lies in that CNNs
[1] can approximate any function [2]. With stochastic gradient
descent algorithm, CNNs can develop effective
representations to describe input data. The representations can
be formed to face challenges in a wide range of tasks [3] and
[4].
Strong as the expressiveness of CNNs is, the pure reliance
on local patterns still hampers their performance, especially in
the cases where input images suffer from variations, such as
scaling [5], deformations, translations, etc. These variations can
cause misclassifications in some critical tasks, such as art
attribution [6]-[10]. This is because the task-relevant clues,
such as textures, change as scale varies.
Current state-of-the-art algorithms dealing with
variations are mostly based on model averaging, such as in
[1][8][9][15], several different CNNs form an ensemble with
each CNN being associated with one scale. Although these
algorithms are effective to some extent, they have the
following limitations:
• Model averaging cannot improve the flexibility of CNNs
and it still relies on local patterns which are scale-variant.
• The CNNs in model averaging are independent. For each
specific input, only one CNN can perform well, other
CNNs may harm the overall performance. This is because
model averaging cannot integrate scale-invariant features
with scale-variant features optimally.
• Traditional frameworks can only perform well when test
images are of the same scale as training images. They
cannot generalize well beyond training data.
• Training cost is high because several CNNs need to be
trained separately.
The cause of the above limitations is that CNNs in
traditional frameworks cannot develop complete feature
representations covering scale-invariant and scale-variant
features. For example, the CNNs [18] tuned to smaller images
tend to rely on scale-invariant features describing contours
while the CNNs tuned to larger images tend to rely on scale-
variant features describing textures. Neither of the above
CNNs develop both types of features.
This problem cannot be solved by scale jittering because
when the CNN is firstly exposed to one scale during training
and testing and then exposed to two scales [19] during training
and testing, its performance will drop. This corresponds to the
problem in statistical learning that if a method is not flexible
enough to model the complex variations in data, bias will
increase. Only by increasing the model’s flexibility can we
reduce the bias and achieve robustness to variations.
We propose to solve the problem by concatenating features
describing local textures with features describing contours.
This increases the model’s flexibility and a robust feature
description including global and local properties can be
generated. The major difference between our framework and
current popular algorithms (such as model averaging [15]) lies
in that the proposed framework focuses on the scale-variant
and scale-invariant features from input images, the feature
representation in the proposed framework is robust to scale
variances. In comparison, each of the CNNs in model
averaging [15] is over-fitted to one scale variant and cannot
tackle images of other scales. Our framework first decomposes
input images into scale-invariant and scale-variant parts using
the preprocessing algorithm addressed in Section II, then feed
each part to one branch in the framework. Each branch in our
2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
Banff Center, Banff, Canada, October 5-8, 2017
978-1-5386-1645-1/17/$31.00 ©2017 IEEE 1019