Accepted by Proceedings of the IEEE
4
multi-class scene classification methods.
III. A
S
URVEY ON
R
EMOTE
S
ENSING
I
MAGE
S
CENE
C
LASSIFICATION
M
ETHODS
During the last decades, considerable efforts have been made
to develop various methods for the task of scene classification
using satellite or aerial images. As scene classification is usually
carried out in feature space, effective feature representation
plays an important role in constructing high-performance scene
classification methods. We can generally divide existing scene
classification methods into three main categories according to
the features they used: handcrafted feature based methods,
unsupervised feature learning based methods, and deep feature
learning based methods. It should be noted that these three
categories are not necessarily independent and sometimes the
same method exists with different categories.
A. Handcrafted Feature Based Methods
The early works for scene classification are mainly based on
handcrafted features [22, 23, 27, 38, 44, 51, 56, 62, 80, 82,
99-103]. These methods mainly focus on using a considerable
amount of engineering skills and domain expertise to design
various human-engineering features, such as color, texture,
shape, spatial and spectral information, or their combination
that are the primary characteristic of a scene image and hence
carry useful information used for scene classification. Here, we
briefly review several most representative handcrafted features,
including color histograms [99], texture descriptors [104-106],
GIST [107], scale invariant feature transform (SIFT) [108], and
histogram of oriented gradients (HOG) [109].
1) Color histograms: Among all handcrafted features, the
global color histogram feature [99] is almost the simplest, yet an
effective visual feature commonly used in image retrieval and
scene classification [38, 56, 80, 82, 99]. A major advantage of
color histograms, apart from their ease to compute, is that they
are invariant to translation and rotation about the viewing axis.
However, color histograms are not able to convey the spatial
information, so it is very difficult to distinguish the images with
the same colors but different color distribution. Besides, color
histogram feature is also sensitive to small illumination changes
and quantization errors.
2) Texture descriptors: Texture features, such as grey level
co-occurrence matrix (GLCM) [104], Gabor feature [105], and
local binary patterns (LBP) [84, 106, 110], etc., are widely used
for analyzing aerial or satellite images [51, 56, 62, 100-102].
Texture features are commonly computed by placing primitives
in local image subregions and analyzing the relative differences,
so they are quite useful for identifying textural scene images.
3) GIST: GIST descriptor was initially proposed in [107],
which provides a global description for representing the spatial
structure of dominant scales and orientations of a scene. It is
based on calculating the statistics of the outputs of local feature
detectors in spatially distributed subregions. Specifically, in
standard GIST, the images are first convoluted with a number of
steerable pyramid filters. Then, the image is divided into a 4×4
grid for which orientation histograms are extracted. Note that
the GIST descriptor is similar in spirit to the local SIFT
descriptor [108]. Owing to its simplicity and efficiency, GIST is
popularly used for scene representation [111-113].
4) SIFT: SIFT feature [108] describes subregions by gradient
information around identified keypoints. Standard SIFT, also
known as sparse SIFT, is the combination of keypoint detection
and histogram based gradient representation. It generally has
four steps, namely, scale space extrema searching, sub-pixel
keypoint refining, dominant orientation assignment, and feature
description. Except for sparse SIFT descriptor, there also exist
dense SIFT that is computed in uniformly and densely sampled
local regions and several extensions such as PCA-SIFT [114]
and speed-up robust features (SURF) [115]. SIFT feature and its
variants are highly distinctive and invariant to changes in scale,
rotation, and illumination.
5) HOG: HOG feature was first proposed by [109] to
represent objects by computing the distribution of gradient
intensities and orientations in spatially distributed subregions,
which has been acknowledged as one of the best features to
capture the edge or local shape information of the objects. It has
shown great success for many scene classification methods [22,
23, 27, 44, 103, 116, 117]. In addition, in order to further
improve the description ability of HOG for remote sensing
images, several extensions are also developed [118-121].
These human-engineering features have their advantages and
disadvantages [56, 90, 101, 102]. In brief, the color histograms,
texture descriptors, and GIST feature are global features that
describe the overall statistical properties of an entire image
scene in terms of certain spatial cues such as color [56, 99],
texture [104-106], or spatial structure information [107], so they
can be directly used by classifiers for scene classification.
Whereas, SIFT descriptor and HOG feature are local features
that are used for the representations of local structure [108] and
shape information [109]. To represent an entire scene image,
they are generally used as building blocks to construct global
image features, such as the well-known bag-of-visual-words
(BoVW) models [6, 8, 9, 14, 19, 29, 36, 38, 39, 55, 93, 101, 122,
123] and HOG feature-based part models [22, 23, 27, 103]. In
addition, a number of improved feature encoding/pooling
methods have also been proposed in the past few years, such as
Fisher vector coding [10, 14, 84, 86], spatial pyramid matching
(SPM) [124], and probabilistic topic model (PTM) [11, 40, 42,
43, 92, 123], etc.
In real-world applications, scene information is usually
conveyed by multiple cues including spectral, color, texture,
shape, and so on. Every individual cue captures only one aspect
of the scene, so one single type of feature is always inadequate
to represent the content of the entire scene image. Accordingly,
a combination of multiple complementary features for scene
classification [8, 9, 11, 12, 20, 30, 33, 85, 88, 89, 92, 125] is
considered as a potential strategy to improve the performance.
For example, Zhao et al. [11] presented a dirichlet derived
multiple topic model to combine three types of features at a
topic level for scene classification. Zhu et al. [8] proposed a