从手工到深度特征的图像匹配：一项综合调查

需积分: 50 28 浏览量更新于2024-07-15 收藏 6.85MB PDF 举报

"Ma 等。 - 2021 - 图像匹配从手工特征到深度特征：一项调查 .pdf" 这篇论文是国际计算机视觉期刊（International Journal of Computer Vision, 2021）的一篇文章，作者包括Jiayi Ma、Xingyu Jiang、Aoxiang Fan、Junjun Jiang 和 Junchi Yan。文章的目的是对过去几十年来，尤其是随着深度学习技术发展以来提出的图像匹配方法进行详尽的回顾与分析。它探讨了图像匹配这个核心任务在各种视觉应用中的重要性，即如何识别并对应来自两幅或多幅图像中的相同或相似结构内容。随着研究的深入，出现了大量多样化的图像匹配方法。然而，这些方法的发展也带来了一些开放性问题，比如针对特定应用场景和任务需求，应如何选择合适的方法，以及如何设计出在准确性、鲁棒性和效率方面表现更优的图像匹配算法。因此，作者们决定对经典和最新的技术进行一次全面而系统性的回顾和评估。文章内容涵盖了从传统的手工特征（如SIFT、SURF、ORB等）到深度学习驱动的特征表示，如卷积神经网络（CNNs）生成的特征。手工特征方法依赖于精心设计的局部描述符，能够在一定程度上抵抗图像变换，但可能无法应对复杂的光照、遮挡和视点变化。而深度学习方法则通过学习数据本身的内在规律，自动生成特征，通常在许多场景下表现出更好的性能。深度学习在图像匹配中的应用主要分为两类：基于端到端训练的模型和利用预训练CNN提取特征的模型。前者，如MatcherNet和PatchNet，直接学习匹配函数；后者，如DeepMatching和DeepFeatureFlow，利用预训练的CNN（如VGG或ResNet）提取图像的高级语义特征，然后进行匹配。这些方法通常在复杂环境中展现出更好的泛化能力，但计算成本相对较高。论文还讨论了各种挑战，如遮挡、光照变化、大规模姿态变化等，以及如何通过改进特征表示、学习策略和损失函数来增强匹配的鲁棒性。此外，作者们也分析了现有的基准测试，如HPatches、TUD-Brussels和Oxford5K等，这些测试集对于评估不同方法在各种条件下的性能至关重要。总结来说，这篇文章为图像匹配领域的研究者和工程师提供了一个全面的视角，帮助他们理解和比较不同方法的优劣，并指导他们在实际应用中做出明智的选择。通过深入研究历史上的技术并探索当前的深度学习趋势，这篇综述有助于推动图像匹配领域未来的创新和发展。

32 International Journal of Computer Vision (2021) 129:23–79

patch is core problems in the task of feature description

and matching. By correctly identifying the size and orien-

tation, the matching methods can be robust and invariant to

global and/or local deformations, such as rotation and scal-

ing. The original intention of feature description is focused

on discrimination enhancement compared with direct simi-

larity measurement using raw image information. Numerous

well-designed descriptors can improve the discrimination

and matching performance, by using pooling parameter

optimization, sampling rule design, or the use of machine

learning and deep learning techniques.

Feature description has drawn increasing attention. Descrip-

tors can be regarded as distinguishable and robust representa-

tions for given images and are widely used not only in image

matching but also in image coding for image retrieval, face

recognition, and other tasks that are based on image similar-

ity measurements. However, direct similarity measurements

for two image patches using raw image information will be

regarded as an area-based image matching method, which

will be reviewed in the next section. As for image patch-based

feature descriptors, we will review the traditional ones, i.e.,

ﬂoating and binary descriptors, in terms of their data types.

A new subsection will be added for the recent data-driven

methods, including classical machine learning- and emerg-

ing deep learning-based methods. We will comprehensively

review handcrafted and learning-based feature description

methods and show the connections among these methods

to provide useful instructions for the readers toward their

further research, especially for developing better description

approaches using deep learning/CNN techniques. In addi-

tion, we will also review the 3-D feature descriptors, where

features are typically obtained from point data without any

image pixel information but with spatial position relation-

ships (e.g., 3-D point cloud registration).

3.2 Handcrafted Feature Descriptors

Handcrafted feature descriptors often depend on expert pri-

ori knowledge, which are still widely used in many visual

applications. Following the construction procedure of a tra-

ditional local descriptor, the ﬁrst step is to extract low-level

information, which can be brieﬂy classiﬁed into image gradi-

ent and intensity. Subsequently, the commonly used pooling

and normalizing strategies, such as statistic and comparison,

are applied to generate long and simple vectors for discrim-

inative description with respect to the data type (ﬂoat or

binary). Therefore, handcrafted descriptors mostly rely on

the knowledge of their authors, and description strategies

can be classiﬁed into gradient statistic-, local binary pat-

tern statistic-, local intensity comparison- and local intensity

order statistic-based methods.

3.2.1 Gradient Statistic-Based Descriptors

Gradient statistic methods are often used to form ﬂoat

type descriptors such as the histogram of oriented gradients

(HOG) (Dalal and Triggs 2005) as introduced in SIFT (Lowe

et al. 1999;Lowe2004) and its improvement versions (Bay

et al. 2006; Morel and Yu 2009; Dong and Soatto 2015;Tola

et al. 2010), and they are still widely used in several modern

visual tasks. In SIFT, feature scale and orientation are respec-

tively determined by DoG computation and the largest bin

in a histogram of gradient orientation from a local circular

region around the detected keypoint, thus achieving scale

and rotation invariance. In the description stage, the local

region of detected feature is ﬁrst rectangularly divided into

4 × 4 non-overlapping grids based on the normalized scale

and rotation, then a histogram of gradient orientation with

8 bins is conducted in each cell and embedded into a 128-

dimensional ﬂoat vector as the SIFT descriptor.

Another representative descriptor, namely, SURF (Bay

et al. 2006), can accelerate the SIFT operator by using the

responses of Haar wavelets to approximate gradient com-

putation; integral images are also applied to avoid repeated

computation in Haar wavelet responses, enabling more efﬁ-

cient computation than SIFT. Other improvements based

on these two typically focus on discrimination, efﬁciency,

robustness, and coping with speciﬁc image data or tasks.

For instance, CSIFT (Abdel-Hakim and Farag 2006)uses

additional color information to enhance the discrimination,

and ASIFT (Morel and Yu 2009) simulates all image views

obtainable by varying the two camera axis orientation param-

eters for fully afﬁne invariance. Mikolajczyk and Schmid

(2005) use a polar division and histogram statistics of gradi-

ent orientations. SIFT-rank (Toews and Wells 2009) has been

proposed to investigate ordinal image description based on

off-the-shelf SIFT for invariant feature correspondence. A

Weber’s law-based method (WLD) (Chen et al. 2009) has

been studied to compute a histogram by encoding differen-

tial excitations and orientations at certain locations.

Arandjelovi´c and Zisserman (2012) used a square root

(Hellinger) kernel instead of the standard Euclidean dis-

tance measurement to transform the original SIFT space

to the RootSIFT space and yielded superior performance

without increasing processing or storage requirements. Dong

and Soatto (2015) modiﬁed SIFT by pooling the gradi-

ent orientation across different domain sizes and proposed

DSP-SIFT descriptor. Another efﬁcient dense descriptor

for wide-baseline stereo based on SIFT, namely, DAISY

(Tola et al. 2010), uses a log-polar grid arrangement and

Gaussian pooling strategy to approximate the histograms of

gradient orientations. Inspired by DAISY, DARTs (Marimon

et al. 2010) can efﬁciently compute scale space and reuse

it for descriptors, thus resulting in high efﬁciency. Several

handcrafted ﬂoat-type descriptors have also been proposed

123

International Journal of Computer Vision (2021) 129:23–79 33

recently and shown promising performance; for example, the

pattern of local gravitational force local descriptor (Bhat-

tacharjee and Roy 2019) is inspired from the law of universal

gravitation and can be regarded as a combination of force

magnitude and angle.

3.2.2 Local Binary Pattern Statistic-Based Descriptors

Different from SIFT-like approaches, several intensity statistic-

based methods, which are inspired by the local binary pattern

(LBP) (Ojala et al. 2002), have been proposed in the past

decades. LBP has properties that favor its usage in inter-

est region description, such as tolerance against illumination

change and computational simplicity. The drawbacks are

that the operator produces a rather long histogram and is

insigniﬁcantly robust in ﬂat image areas. Center-symmetric

LBP (CS-LBP) (Heikkilä et al. 2009) (using SVM for clas-

siﬁer training) is a modiﬁed version of LBP combining the

strengths of SIFT and LBP to address the ﬂat area problem.

Speciﬁcally, CS-LBP uses a SIFT-like grid and replaces the

gradient information with an LBP-based feature. To address

the noise, center-symmetric local ternary pattern (CS-LTP)

(Gupta et al. 2010) suggests the use of a histogram of rel-

ative orders in patch and a histogram of LBP codes, such

as histogram of relative intensities. The two CS-based meth-

ods are designed to be more robust to Gaussian noise than

previously considered descriptors. RLBP (Chen et al. 2013)

improves the robustness of LBP by changing the coding bit;

a completed modeling of t he LBP operator and an associ-

ated completed LBP scheme (Guo et al. 2010) have been

developed for texture classiﬁcation. LBP-like methods are

widely used in texture representation and face recognition

community, and additional details can be found in the review

literature (Huang et al. 2011).

3.2.3 Local Intensity Comparison-Based Descriptors

Another form of descriptors is based on the comparison

of local intensities, which is also called binary descriptors

and the core challenge is the selection rule for comparison.

Because of their limited distinctiveness, these methods are

mostly limited to short-baseline matching. Calonder et al.

(2010) proposed the BRIEF descriptor built by concatena-

tion of the results of a binary test of intensities for several

random point pairs in image patch. Rublee et al. (2011)pro-

posed rotated BRIEF combined with oriented FAST corners

and selected robust binary tests using an machine learning

strategy in their ORB algorithm to alleviate the limitations in

rotation and scale change. Leutenegger et al. (2011)devel-

oped the BRISK method using a concentric circle sampling

strategy with increasing radius. Inspired by the retina struc-

ture, Alahi et al. (2012) proposed the FREAK descriptor by

comparing image intensities over a retinal sampling pattern

for fast computing and matching with low memory cost while

remaining robust to scale, rotation, and noise. Handcrafted

binary descriptors and classical machine learning techniques

are also widely studied and these shall be introduced in the

learning-based subsection.

3.2.4 Local Intensity Order Statistic-Based Descriptors

Thus far, many methods have been devised using orders

of pixel values rather than raw intensities, achieving more

promising performance (Tang et al. 2009; Toews and Wells

2009). Pooling by intensity orders is invariant to rotation

and monotonic intensity changes and also encodes ordi-

nal information into descriptor; the intensity order-pooling

scheme may enable the descriptors to be rotation-invariant

without estimation of a reference orientation as SIFT, which

appears as a major error source for most existing methods.

To solve this problem, Tang et al. proposed the ordinal spa-

tial intensity distribution (Tang et al. 2009) method, which

normalizes captured texture information and structure infor-

mation using an ordinal and spatial intensity histogram; the

proposed method is invariant to any monotonically increas-

ing brightness changes.

Fan et al. (2011) pooled local features based on their gra-

dient and intensity orders in multiple support regions and

proposed the multi-support region order-based gradient his-

togram and the multi-support region rotation and intensity

monotonic invariant descriptor methods. A similar strategy

was used in LIOP (Wang et al. 2011, 2015), to encode the

local ordinal information of each pixel. In that work, the over-

all ordinal information was used to divide the local patch into

subregions, which were used to accumulate LIOP. LIOP was

further improved into OIOP/MIOP (Wang et al. 2015), which

can then encode overall ordinal information for noise and

distortion robustness. They also proposed a learning-based

quantization to improve its distinctiveness.

3.3 Learning-Based Feature Descriptors

Handcrafted descriptors, as reviewed above, require exper-

tise to design and may disregard useful patterns hidden in

the data. This requirement has prompted the investigations

on learning-based descriptors, which have recently become

dominantly popular due to their data-driven property and

promising performance. In the following, we will discuss

a group of classical learning-based descriptors introduced

before the deep learning era.

3.3.1 Classical L earning-Based Descriptors

The learning-based descriptors can be traced back to PCA-

SIFT (Ke et al. 2004), in which principal component analysis

(PCA) is used to form a robust and compact descriptor by

123

34 International Journal of Computer Vision (2021) 129:23–79

reducing the dimensionality of a vector made of the local

image gradients. Cai et al. (2010) investigated the use of

linear discriminant projections to reduce dimensionality and

improve the discriminability of local descriptors. Brown et al.

(2010) introduced a learning framework with a set of building

blocks for constructing descriptors by using Powell mini-

mization and linear discriminant analysis (LDA) technique

to ﬁnd the optimal parameters. Simonyan et al. (2014)pre-

sented a novel formulation to represent the spatial pooling

and dimensionality reduction in descriptor learning as con-

vex optimization problems based on Brown’s work (Brown

et al. 2010). Meanwhile, Trzcinski et al. (2012, 2014) applied

the boosting trick to learn boosted, complex non-linear local

visual feature representations from multiple gradient-based

weak learners.

Apart from the above-mentioned ﬂoat-valued descrip-

tors, binary descriptors are also of great interest in classical

descriptor learning due to their beneﬁcial properties, such as

low storage requirements and high matching speed. A nat-

ural way to obtain binary descriptors is to learn it from the

provided ﬂoat-valued descriptors. This task is convention-

ally achieved by the hashing methods, thus suggesting that

compact representations of high-dimensional data should

be learned while maintaining their similarity in the new

space. Locality sensitive hashing (LSH) (Gionis et al. 1999)

is arguably a popular unsupervised hashing method. This

method generates embeddings via random projections and

has been used for many large-scale s earch tasks. Some vari-

ants of LSH include kernelized LSH (Kulis and Grauman

2009), spectral hashing (Weiss et al. 2009), semantic hashing

(Salakhutdinov and Hinton 2009) and p-stable distribution-

based LSH (Datar et al. 2004). These variants are unsuper-

vised by design.

Supervised hashing methods have also been extensively

investigated, where different machine learning strategies

have been proposed to learn feature spaces tailored to speciﬁc

tasks. In this case, a plethora of methods have been proposed

(Kulis and Darrell 2009; Wang et al. 2010; Strecha et al.

2012; Liu et al. 2012a; Norouzi and Blei 2011; Gong et al.

2013; Shakhnarovich 2005), among which image matching

is considered an important experimental validation task. For

example, the LDA technique is utilized in Strecha et al.

(2012) to aid hashing. Semi-supervised sequential learning

algorithms are proposed in Liu et al. (2012a) and Wang et al.

(2010) to ﬁnd discriminative projections. Minimal loss hash-

ing (Norouzi and Blei 2011) provided a new formulation to

learn binary hash functions on the basis of structural SVMs

with latent variables. Gong et al. (2012) proposed searching

a rotation of zero-centered data to minimize the quantization

error of mapping the descriptor to the vertices of a zero-

centered binary hypercube.

Trzcinski and Lepetit (2012) and Trzcinski et al. (2017)

reported that a straightforward way of developing binary

descriptors is to directly learn representations from image

patches. In Trzcinski and Lepetit (2012), they proposed to

project image patches to a discriminant subspace by using a

linear combination of a few simple ﬁlters and then threshold

their coordinates for creating the compact binary descrip-

tor. The success of descriptors (e.g., SIFT) during image

matching indicates that non-linear ﬁlters, such as gradient

response, are more suitable than linear ones. Trzcinski et al.

(2017) proposed to learn a hash function of the same form as

an AdaBoost strong classiﬁer, i.e. the sign of a linear com-

bination of nonlinear weak learners, for each descriptor bit.

This work is more general and powerful than Trzcinski and

Lepetit (2012), which is based on simple thresholded lin-

ear projections. Trzcinski et al. (2017) proposed to generate

binary descriptors that are independently adapted per patch.

This objective is achieved by inter- and intra-class online

optimization for descriptors.

3.3.2 Deep Learning-Based Descriptors

Descriptors using deep techniques are usually formulated as a

supervised learning problem. The objective is to learn a rep-

resentation that can enable the two matched features to be

as close as possible while the unmatched ones are far apart

in the measuring space (Schonberger et al. 2017). Descrip-

tor learning is often conducted with cropped local patches

centered on the detected keypoints; thus, it is also known as

patch matching. In general, existing methods consist of two

forms, namely, metric learning (Weinberger and Saul 2009;

Zagoruyko and Komodakis 2015; Han et al. 2015; Kedem

et al. 2012; Wang et al. 2017; Weinberger and Saul 2009)

and descriptor learning (Simo-Serra et al. 2015; Balntas et al.

2016a, 2017; Zhang et al. 2017c; Mishchuk et al. 2017;Wei

et al. 2018;Heetal.2018; Tian et al. 2019; Luo et al. 2019),

according to the output of deep learning-based descriptors.

These two forms are often jointly trained. Speciﬁcally, metric

learning methods often learn a discriminative metric for simi-

larity measurement with raw patches or generated descriptors

as inputs. By contrast, descriptor learning tends to generate

the descriptor representation from raw images or patches.

Such a process requires a measurement method, such as L2

distance or trained metric network, for similarity evaluation.

In contrast with single metric learning, the use of CNNs to

generate description vectors is more ﬂexible and may save

time by avoiding repeated computation when a large number

of candidate patches are available for correspondence search.

Deep learning has achieved satisfying performance in feature

description due to its strong ability in information extraction

and representation.

Descriptors with deep learning techniques can be regarded

as an extension of those based on classical learning (Schon-

berger et al. 2017). For instance, the Siamese structure in

Chopra et al. (2005) and the commonly used loss func-

123

剩余56页未读，继续阅读

曾志伟天下第一！

粉丝: 1
资源: 15

从手工到深度特征的图像匹配：一项综合调查

译文-从手工制作到深度特征的图像匹配综述.docx

RIFT-multimodal-image-matching-main.zip

halcon-6.1-shape-matching.pdf

image-matching.rar_image matching_site:www.pudn.com_同名点匹配_相关系数_相

Image-matching-code.zip_Matching_StereoRegion_pattern matching_图

639-Multi-scale Orderless Pooling of Deep Convolutional Activation Features.pdf

Image-Matching.rar_image matching

image-matching.rar_image matching_image-matching_图像匹配_图片匹配

MATLAB典型环节代码-dendrite-puf-image-matching-spring2019:2019年Spring树突状puf图像

image-matching.rar_image-matching

最新资源