International Journal of Computer Vision
patch is core problems in the task of feature description
and matching. By correctly identifying the size and orien-
tation, the matching methods can be robust and invariant to
global and/or local deformations, such as rotation and scal-
ing. The original intention of feature description is focused
on discrimination enhancement compared with direct simi-
larity measurement using raw image information. Numerous
well-designed descriptors can improve the discrimination
and matching performance, by using pooling parameter
optimization, sampling rule design, or the use of machine
learning and deep learning techniques.
Feature description has drawn increasing attention. Descrip-
tors can be regarded as distinguishable and robust representa-
tions for given images and are widely used not only in image
matching but also in image coding for image retrieval, face
recognition, and other tasks that are based on image similar-
ity measurements. However, direct similarity measurements
for two image patches using raw image information will be
regarded as an area-based image matching method, which
will be reviewed in the next section. As for image patch-based
feature descriptors, we will review the traditional ones, i.e.,
floating and binary descriptors, in terms of their data types.
A new subsection will be added for the recent data-driven
methods, including classical machine learning- and emerg-
ing deep learning-based methods. We will comprehensively
review handcrafted and learning-based feature description
methods and show the connections among these methods
to provide useful instructions for the readers toward their
further research, especially for developing better description
approaches using deep learning/CNN techniques. In addi-
tion, we will also review the 3-D feature descriptors, where
features are typically obtained from point data without any
image pixel information but with spatial position relation-
ships (e.g., 3-D point cloud registration).
3.2 Handcrafted Feature Descriptors
Handcrafted feature descriptors often depend on expert pri-
ori knowledge, which are still widely used in many visual
applications. Following the construction procedure of a tra-
ditional local descriptor, the first step is to extract low-level
information, which can be briefly classified into image gradi-
ent and intensity. Subsequently, the commonly used pooling
and normalizing strategies, such as statistic and comparison,
are applied to generate long and simple vectors for discrim-
inative description with respect to the data type (float or
binary). Therefore, handcrafted descriptors mostly rely on
the knowledge of their authors, and description strategies
can be classified into gradient statistic-, local binary pat-
tern statistic-, local intensity comparison- and local intensity
order statistic-based methods.
3.2.1 Gradient Statistic-Based Descriptors
Gradient statistic methods are often used to form float
type descriptors such as the histogram of oriented gradients
(HOG) (Dalal and Triggs 2005) as introduced in SIFT (Lowe
et al. 1999;Lowe2004) and its improvement versions (Bay
et al. 2006; Morel and Yu 2009; Dong and Soatto 2015;Tola
et al. 2010), and they are still widely used in several modern
visual tasks. In SIFT, feature scale and orientation are respec-
tively determined by DoG computation and the largest bin
in a histogram of gradient orientation from a local circular
region around the detected keypoint, thus achieving scale
and rotation invariance. In the description stage, the local
region of detected feature is first rectangularly divided into
4 × 4 non-overlapping grids based on the normalized scale
and rotation, then a histogram of gradient orientation with
8 bins is conducted in each cell and embedded into a 128-
dimensional float vector as the SIFT descriptor.
Another representative descriptor, namely, SURF (Bay
et al. 2006), can accelerate the SIFT operator by using the
responses of Haar wavelets to approximate gradient com-
putation; integral images are also applied to avoid repeated
computation in Haar wavelet responses, enabling more effi-
cient computation than SIFT. Other improvements based
on these two typically focus on discrimination, efficiency,
robustness, and coping with specific image data or tasks.
For instance, CSIFT (Abdel-Hakim and Farag 2006)uses
additional color information to enhance the discrimination,
and ASIFT (Morel and Yu 2009) simulates all image views
obtainable by varying the two camera axis orientation param-
eters for fully affine invariance. Mikolajczyk and Schmid
(2005) use a polar division and histogram statistics of gradi-
ent orientations. SIFT-rank (Toews and Wells 2009) has been
proposed to investigate ordinal image description based on
off-the-shelf SIFT for invariant feature correspondence. A
Weber’s law-based method (WLD) (Chen et al. 2009) has
been studied to compute a histogram by encoding differen-
tial excitations and orientations at certain locations.
Arandjelovi´c and Zisserman (2012) used a square root
(Hellinger) kernel instead of the standard Euclidean dis-
tance measurement to transform the original SIFT space
to the RootSIFT space and yielded superior performance
without increasing processing or storage requirements. Dong
and Soatto (2015) modified SIFT by pooling the gradi-
ent orientation across different domain sizes and proposed
DSP-SIFT descriptor. Another efficient dense descriptor
for wide-baseline stereo based on SIFT, namely, DAISY
(Tola et al. 2010), uses a log-polar grid arrangement and
Gaussian pooling strategy to approximate the histograms of
gradient orientations. Inspired by DAISY, DARTs (Marimon
et al. 2010) can efficiently compute scale space and reuse
it for descriptors, thus resulting in high efficiency. Several
handcrafted float-type descriptors have also been proposed
123