从手工到深度学习的图像匹配：一项综合调查

需积分: 23 156 浏览量更新于2024-06-30 1 收藏 6.79MB PDF 举报

"这篇文章是关于图像匹配的综述，涵盖了从传统的手工设计特征到深度学习方法的演变。作者包括Jiayi Ma、Xingyu Jiang、Aoxiang Fan、Junjun Jiang和Junchi Yan，发表在国际计算机视觉期刊上。文章探讨了随着深度学习技术的发展，图像匹配领域的大量和多样化的算法，并对特定应用场景和任务需求下选择合适方法的开放性问题进行了分析。" 正文：图像匹配是计算机视觉领域的一个基础且关键的任务，它能识别并对应不同图像中的相同或相似结构和内容。自上世纪以来，各种各样的图像匹配方法不断涌现，特别是在近年来深度学习技术的推动下，这个领域取得了显著的进步。然而，面对如此多的方法，如何根据具体场景和任务需求选择合适的技术，以及如何设计性能更优、鲁棒性更强、效率更高的图像匹配方法，一直是研究者关注的问题。文章首先沿着基于特征的图像匹配流程展开，介绍了特征检测的重要性。传统图像匹配通常依赖于手工设计的特征，如SIFT（尺度不变特征转换）、SURF（加速稳健特征）和ORB（快速ORB）。这些特征考虑了图像的尺度变化、旋转和光照影响，能在一定程度上保证特征的不变性。它们在匹配过程中起到了关键作用，但受限于人工设计，可能存在适应性不强、计算量大等问题。随着深度学习的兴起，出现了许多利用神经网络学习图像特征的方法，如CNN（卷积神经网络）和RNN（循环神经网络）。这些深度学习特征，如VGG、ResNet和 DenseNet 提取的特征，能够自动学习图像的多层次表示，适应性更强，匹配精度更高。同时，端到端的学习框架允许特征检测和匹配过程一起优化，进一步提升了整体性能。然而，深度学习方法也面临挑战，如需要大量的标注数据进行训练、模型复杂度高以及对计算资源的需求增加。因此，研究者也在探索如何在保持高性能的同时，降低计算复杂性和内存需求，例如轻量化网络结构和使用注意力机制。此外，文章还讨论了图像匹配在各种应用场景中的应用，如全景拼接、三维重建、物体识别和跟踪等。针对不同的应用场景，可能需要权衡精度、速度和鲁棒性。例如，在实时监控系统中，快速且鲁棒的匹配方法更为重要；而在高精度的三维重建任务中，可能更倾向于选择精度更高的算法。最后，作者对当前的挑战和未来的研究方向进行了总结。这包括如何进一步提高深度学习特征的泛化能力、开发适用于低资源设备的轻量化模型，以及探索如何结合传统特征与深度学习特征以获得最佳效果。这篇综述为读者提供了全面理解图像匹配技术的窗口，从传统方法到深度学习的转变，以及如何根据具体需求选择和优化匹配策略。这对于研究人员和实践者来说是一份宝贵的参考资料，有助于推动图像匹配领域的持续发展。

International Journal of Computer Vision

patch is core problems in the task of feature description

and matching. By correctly identifying the size and orien-

tation, the matching methods can be robust and invariant to

global and/or local deformations, such as rotation and scal-

ing. The original intention of feature description is focused

on discrimination enhancement compared with direct simi-

larity measurement using raw image information. Numerous

well-designed descriptors can improve the discrimination

and matching performance, by using pooling parameter

optimization, sampling rule design, or the use of machine

learning and deep learning techniques.

Feature description has drawn increasing attention. Descrip-

tors can be regarded as distinguishable and robust representa-

tions for given images and are widely used not only in image

matching but also in image coding for image retrieval, face

recognition, and other tasks that are based on image similar-

ity measurements. However, direct similarity measurements

for two image patches using raw image information will be

regarded as an area-based image matching method, which

will be reviewed in the next section. As for image patch-based

feature descriptors, we will review the traditional ones, i.e.,

ﬂoating and binary descriptors, in terms of their data types.

A new subsection will be added for the recent data-driven

methods, including classical machine learning- and emerg-

ing deep learning-based methods. We will comprehensively

review handcrafted and learning-based feature description

methods and show the connections among these methods

to provide useful instructions for the readers toward their

further research, especially for developing better description

approaches using deep learning/CNN techniques. In addi-

tion, we will also review the 3-D feature descriptors, where

features are typically obtained from point data without any

image pixel information but with spatial position relation-

ships (e.g., 3-D point cloud registration).

3.2 Handcrafted Feature Descriptors

Handcrafted feature descriptors often depend on expert pri-

ori knowledge, which are still widely used in many visual

applications. Following the construction procedure of a tra-

ditional local descriptor, the ﬁrst step is to extract low-level

information, which can be brieﬂy classiﬁed into image gradi-

ent and intensity. Subsequently, the commonly used pooling

and normalizing strategies, such as statistic and comparison,

are applied to generate long and simple vectors for discrim-

inative description with respect to the data type (ﬂoat or

binary). Therefore, handcrafted descriptors mostly rely on

the knowledge of their authors, and description strategies

can be classiﬁed into gradient statistic-, local binary pat-

tern statistic-, local intensity comparison- and local intensity

order statistic-based methods.

3.2.1 Gradient Statistic-Based Descriptors

Gradient statistic methods are often used to form ﬂoat

type descriptors such as the histogram of oriented gradients

(HOG) (Dalal and Triggs 2005) as introduced in SIFT (Lowe

et al. 1999;Lowe2004) and its improvement versions (Bay

et al. 2006; Morel and Yu 2009; Dong and Soatto 2015;Tola

et al. 2010), and they are still widely used in several modern

visual tasks. In SIFT, feature scale and orientation are respec-

tively determined by DoG computation and the largest bin

in a histogram of gradient orientation from a local circular

region around the detected keypoint, thus achieving scale

and rotation invariance. In the description stage, the local

region of detected feature is ﬁrst rectangularly divided into

4 × 4 non-overlapping grids based on the normalized scale

and rotation, then a histogram of gradient orientation with

8 bins is conducted in each cell and embedded into a 128-

dimensional ﬂoat vector as the SIFT descriptor.

Another representative descriptor, namely, SURF (Bay

et al. 2006), can accelerate the SIFT operator by using the

responses of Haar wavelets to approximate gradient com-

putation; integral images are also applied to avoid repeated

computation in Haar wavelet responses, enabling more efﬁ-

cient computation than SIFT. Other improvements based

on these two typically focus on discrimination, efﬁciency,

robustness, and coping with speciﬁc image data or tasks.

For instance, CSIFT (Abdel-Hakim and Farag 2006)uses

additional color information to enhance the discrimination,

and ASIFT (Morel and Yu 2009) simulates all image views

obtainable by varying the two camera axis orientation param-

eters for fully afﬁne invariance. Mikolajczyk and Schmid

(2005) use a polar division and histogram statistics of gradi-

ent orientations. SIFT-rank (Toews and Wells 2009) has been

proposed to investigate ordinal image description based on

off-the-shelf SIFT for invariant feature correspondence. A

Weber’s law-based method (WLD) (Chen et al. 2009) has

been studied to compute a histogram by encoding differen-

tial excitations and orientations at certain locations.

Arandjelovi´c and Zisserman (2012) used a square root

(Hellinger) kernel instead of the standard Euclidean dis-

tance measurement to transform the original SIFT space

to the RootSIFT space and yielded superior performance

without increasing processing or storage requirements. Dong

and Soatto (2015) modiﬁed SIFT by pooling the gradi-

ent orientation across different domain sizes and proposed

DSP-SIFT descriptor. Another efﬁcient dense descriptor

for wide-baseline stereo based on SIFT, namely, DAISY

(Tola et al. 2010), uses a log-polar grid arrangement and

Gaussian pooling strategy to approximate the histograms of

gradient orientations. Inspired by DAISY, DARTs (Marimon

et al. 2010) can efﬁciently compute scale space and reuse

it for descriptors, thus resulting in high efﬁciency. Several

handcrafted ﬂoat-type descriptors have also been proposed

123

International Journal of Computer Vision

recently and shown promising performance; for example, the

pattern of local gravitational force local descriptor (Bhat-

tacharjee and Roy 2019) is inspired from the law of universal

gravitation and can be regarded as a combination of force

magnitude and angle.

3.2.2 Local Binary Pattern Statistic-Based Descriptors

Different from SIFT-like approaches, several intensity statistic-

based methods, which are inspired by the local binary pattern

(LBP) (Ojala et al. 2002), have been proposed in the past

decades. LBP has properties that favor its usage in inter-

est region description, such as tolerance against illumination

change and computational simplicity. The drawbacks are

that the operator produces a rather long histogram and is

insigniﬁcantly robust in ﬂat image areas. Center-symmetric

LBP (CS-LBP) (Heikkilä et al. 2009) (using SVM for clas-

siﬁer training) is a modiﬁed version of LBP combining the

strengths of SIFT and LBP to address the ﬂat area problem.

Speciﬁcally, CS-LBP uses a SIFT-like grid and replaces the

gradient information with an LBP-based feature. To address

the noise, center-symmetric local ternary pattern (CS-LTP)

(Gupta et al. 2010) suggests the use of a histogram of rel-

ative orders in patch and a histogram of LBP codes, such

as histogram of relative intensities. The two CS-based meth-

ods are designed to be more robust to Gaussian noise than

previously considered descriptors. RLBP (Chen et al. 2013)

improves the robustness of LBP by changing the coding bit;

a completed modeling of t he LBP operator and an associ-

ated completed LBP scheme (Guo et al. 2010) have been

developed for texture classiﬁcation. LBP-like methods are

widely used in texture representation and face recognition

community, and additional details can be found in the review

literature (Huang et al. 2011).

3.2.3 Local Intensity Comparison-Based Descriptors

Another form of descriptors is based on the comparison

of local intensities, which is also called binary descriptors

and the core challenge is the selection rule for comparison.

Because of their limited distinctiveness, these methods are

mostly limited to short-baseline matching. Calonder et al.

(2010) proposed the BRIEF descriptor built by concatena-

tion of the results of a binary test of intensities for several

random point pairs in image patch. Rublee et al. (2011)pro-

posed rotated BRIEF combined with oriented FAST corners

and selected robust binary tests using an machine learning

strategy in their ORB algorithm to alleviate the limitations in

rotation and scale change. Leutenegger et al. (2011)devel-

oped the BRISK method using a concentric circle sampling

strategy with increasing radius. Inspired by the retina struc-

ture, Alahi et al. (2012) proposed the FREAK descriptor by

comparing image intensities over a retinal sampling pattern

for fast computing and matching with low memory cost while

remaining robust to scale, rotation, and noise. Handcrafted

binary descriptors and classical machine learning techniques

are also widely studied and these shall be introduced in the

learning-based subsection.

3.2.4 Local Intensity Order Statistic-Based Descriptors

Thus far, many methods have been devised using orders

of pixel values rather than raw intensities, achieving more

promising performance (Tang et al. 2009; Toews and Wells

2009). Pooling by intensity orders is invariant to rotation

and monotonic intensity changes and also encodes ordi-

nal information into descriptor; the intensity order-pooling

scheme may enable the descriptors to be rotation-invariant

without estimation of a reference orientation as SIFT, which

appears as a major error source for most existing methods.

To solve this problem, Tang et al. proposed the ordinal spa-

tial intensity distribution (Tang et al. 2009) method, which

normalizes captured texture information and structure infor-

mation using an ordinal and spatial intensity histogram; the

proposed method is invariant to any monotonically increas-

ing brightness changes.

Fan et al. (2011) pooled local features based on their gra-

dient and intensity orders in multiple support regions and

proposed the multi-support region order-based gradient his-

togram and the multi-support region rotation and intensity

monotonic invariant descriptor methods. A similar strategy

was used in LIOP (Wang et al. 2011, 2015), to encode the

local ordinal information of each pixel. In that work, the over-

all ordinal information was used to divide the local patch into

subregions, which were used to accumulate LIOP. LIOP was

further improved into OIOP/MIOP (Wang et al. 2015), which

can then encode overall ordinal information for noise and

distortion robustness. They also proposed a learning-based

quantization to improve its distinctiveness.

3.3 Learning-Based Feature Descriptors

Handcrafted descriptors, as reviewed above, require exper-

tise to design and may disregard useful patterns hidden in

the data. This requirement has prompted the investigations

on learning-based descriptors, which have recently become

dominantly popular due to their data-driven property and

promising performance. In the following, we will discuss

a group of classical learning-based descriptors introduced

before the deep learning era.

3.3.1 Classical L earning-Based Descriptors

The learning-based descriptors can be traced back to PCA-

SIFT (Ke et al. 2004), in which principal component analysis

(PCA) is used to form a robust and compact descriptor by

123

International Journal of Computer Vision

reducing the dimensionality of a vector made of the local

image gradients. Cai et al. (2010) investigated the use of

linear discriminant projections to reduce dimensionality and

improve the discriminability of local descriptors. Brown et al.

(2010) introduced a learning framework with a set of building

blocks for constructing descriptors by using Powell mini-

mization and linear discriminant analysis (LDA) technique

to ﬁnd the optimal parameters. Simonyan et al. (2014)pre-

sented a novel formulation to represent the spatial pooling

and dimensionality reduction in descriptor learning as con-

vex optimization problems based on Brown’s work (Brown

et al. 2010). Meanwhile, Trzcinski et al. (2012, 2014) applied

the boosting trick to learn boosted, complex non-linear local

visual feature representations from multiple gradient-based

weak learners.

Apart from the above-mentioned ﬂoat-valued descrip-

tors, binary descriptors are also of great interest in classical

descriptor learning due to their beneﬁcial properties, such as

low storage requirements and high matching speed. A nat-

ural way to obtain binary descriptors is to learn it from the

provided ﬂoat-valued descriptors. This task is convention-

ally achieved by the hashing methods, thus suggesting that

compact representations of high-dimensional data should

be learned while maintaining their similarity in the new

space. Locality sensitive hashing (LSH) (Gionis et al. 1999)

is arguably a popular unsupervised hashing method. This

method generates embeddings via random projections and

has been used for many large-scale s earch tasks. Some vari-

ants of LSH include kernelized LSH (Kulis and Grauman

2009), spectral hashing (Weiss et al. 2009), semantic hashing

(Salakhutdinov and Hinton 2009) and p-stable distribution-

based LSH (Datar et al. 2004). These variants are unsuper-

vised by design.

Supervised hashing methods have also been extensively

investigated, where different machine learning strategies

have been proposed to learn feature spaces tailored to speciﬁc

tasks. In this case, a plethora of methods have been proposed

(Kulis and Darrell 2009; Wang et al. 2010; Strecha et al.

2012; Liu et al. 2012a; Norouzi and Blei 2011; Gong et al.

2013; Shakhnarovich 2005), among which image matching

is considered an important experimental validation task. For

example, the LDA technique is utilized in Strecha et al.

(2012) to aid hashing. Semi-supervised sequential learning

algorithms are proposed in Liu et al. (2012a) and Wang et al.

(2010) to ﬁnd discriminative projections. Minimal loss hash-

ing (Norouzi and Blei 2011) provided a new formulation to

learn binary hash functions on the basis of structural SVMs

with latent variables. Gong et al. (2012) proposed searching

a rotation of zero-centered data to minimize the quantization

error of mapping the descriptor to the vertices of a zero-

centered binary hypercube.

Trzcinski and Lepetit (2012) and Trzcinski et al. (2017)

reported that a straightforward way of developing binary

descriptors is to directly learn representations from image

patches. In Trzcinski and Lepetit (2012), they proposed to

project image patches to a discriminant subspace by using a

linear combination of a few simple ﬁlters and then threshold

their coordinates for creating the compact binary descrip-

tor. The success of descriptors (e.g., SIFT) during image

matching indicates that non-linear ﬁlters, such as gradient

response, are more suitable than linear ones. Trzcinski et al.

(2017) proposed to learn a hash function of the same form as

an AdaBoost strong classiﬁer, i.e. the sign of a linear com-

bination of nonlinear weak learners, for each descriptor bit.

This work is more general and powerful than Trzcinski and

Lepetit (2012), which is based on simple thresholded lin-

ear projections. Trzcinski et al. (2017) proposed to generate

binary descriptors that are independently adapted per patch.

This objective is achieved by inter- and intra-class online

optimization for descriptors.

3.3.2 Deep Learning-Based Descriptors

Descriptors using deep techniques are usually formulated as a

supervised learning problem. The objective is to learn a rep-

resentation that can enable the two matched features to be

as close as possible while the unmatched ones are far apart

in the measuring space (Schonberger et al. 2017). Descrip-

tor learning is often conducted with cropped local patches

centered on the detected keypoints; thus, it is also known as

patch matching. In general, existing methods consist of two

forms, namely, metric learning (Weinberger and Saul 2009;

Zagoruyko and Komodakis 2015; Han et al. 2015; Kedem

et al. 2012; Wang et al. 2017; Weinberger and Saul 2009)

and descriptor learning (Simo-Serra et al. 2015; Balntas et al.

2016a, 2017; Zhang et al. 2017c; Mishchuk et al. 2017;Wei

et al. 2018;Heetal.2018; Tian et al. 2019; Luo et al. 2019),

according to the output of deep learning-based descriptors.

These two forms are often jointly trained. Speciﬁcally, metric

learning methods often learn a discriminative metric for simi-

larity measurement with raw patches or generated descriptors

as inputs. By contrast, descriptor learning tends to generate

the descriptor representation from raw images or patches.

Such a process requires a measurement method, such as L2

distance or trained metric network, for similarity evaluation.

In contrast with single metric learning, the use of CNNs to

generate description vectors is more ﬂexible and may save

time by avoiding repeated computation when a large number

of candidate patches are available for correspondence search.

Deep learning has achieved satisfying performance in feature

description due to its strong ability in information extraction

and representation.

Descriptors with deep learning techniques can be regarded

as an extension of those based on classical learning (Schon-

berger et al. 2017). For instance, the Siamese structure in

Chopra et al. (2005) and the commonly used loss func-

123

剩余56页未读，继续阅读

puluowangsi2

粉丝: 0

从手工到深度学习的图像匹配：一项综合调查

计算机视觉中的图像匹配综述.pdf

基于传统机器学习与深度学习的图像分类算法对比分析.pdf

基于深度学习的遥感图像匹配方法.pdf

摄影测量与深度学习.pdf

小样本困境下的深度学习图像识别综述.pdf

多模态深度学习综述.pdf

基于深度学习的图像配准方法综述.pdf

基于深度学习的图像修复方法综述.pdf

基于深度学习的图像识别模型研究综述.pdf

计算机视觉中的图匹配方法研究综述 (1).pdf

最新资源