多模态图像匹配技术与应用综述

下载需积分: 50 | PDF格式 | 22.68MB | 更新于2024-07-14 | 168 浏览量 | 举报

"这篇论文是关于多模态图像匹配方法及其应用的综述，由武汉大学电子信息学院的研究人员撰写。文章探讨了如何在不同模态的图像之间识别并对应相同或相似的结构或内容，涵盖了多模态配准、深度学习、医学影像、遥感以及计算机视觉等多个关键领域。" 多模态图像配准是一种关键技术，在多个模态的图像数据中寻找对应点或区域，以实现准确的图像融合和分析。这种技术广泛应用于医学成像、遥感图像处理和计算机视觉等领域。医学影像中，如CT、MRI和超声图像，具有不同的成像方式和特性，通过多模态配准可以获取更全面的病灶信息。遥感图像则利用多种传感器获取地球表面的不同信息，配准有助于环境监测和灾害评估。文章深入研究了多模态图像匹配的方法，包括传统的基于特征匹配、强度相似性以及几何变换的方法。特征匹配通常涉及提取图像中的关键点和描述符，如SIFT、SURF等，然后寻找在不同模态图像中对应的特征点。强度相似性方法利用图像像素值的相似度来估计对应关系。几何变换则考虑了图像间的位姿差异，如刚体变换、仿射变换等。近年来，深度学习技术在多模态图像匹配领域取得了显著进展。深度神经网络（DNNs）能够学习图像的复杂表示，通过端到端训练实现高精度的匹配。例如，Siamese网络和孪生网络用于比较不同模态的图像特征，而CNN-RNN结构结合了卷积神经网络和循环神经网络，用于考虑时间序列信息的配准问题。除了这些基础方法，论文还可能讨论了特定领域的应用挑战和解决方案，如解决模态间的大变形、光照变化、噪声干扰等问题。此外，评估指标和优化策略也是多模态配准中的重要部分，例如，使用互信息、均方误差等度量来评估配准质量，并通过迭代优化算法提高配准的准确性。这篇综述论文全面总结了多模态图像匹配的现有方法和技术，分析了它们在不同领域的应用，并探讨了未来的研究方向，对于研究人员和实践者理解并改进多模态图像配准技术具有重要参考价值。

Information Fusion 73 (2021) 22–71

X. Jiang et al.

Nonparametric model-based methods are developed to handle both

rigid and nonrigid transformations, thus showing more flexibility. Such

methods are representative by defining deformation functions in a high-

dimensional form, such as triangulated 2D mesh [318] or a kernel rep-

resentation in reproducing kernel Hilbert space with Tikhonov regular-

ization [319–321]. These methods are typically optimized through tai-

lored robust optimizers, such as Huber estimator [322], L2E [320], sup-

port vector regression [323], or EM solution in a Bayesian model [319,

321,324–327].

Another active topic is the investigation of relaxed methods for

mismatch removal. These relaxation rules are typically based on the as-

sumption of locality or piecewise consistency [328], such as grid-based

motion statistics (GMS) [329], locality preserving matching (LPM)

[330] and its varieties [331,332], feature matching using spatial clus-

tering with heavy outliers [333], and coherence-based decision bound-

aries [334]. Other strategies include using a filtering theory [335,336]

or Markov random field formulation [337]. These methods are efficient

and can handle both rigid and nonrigid image matching. Thus, they

are more acceptable for real-time required applications because they

achieve solutions quickly by using less-strict geometric constraints.

However, these relaxed methods are commonly sensitive to parameter

setting and greatly rely on a dense match set, which should well

maintain the local coherency of correct matches.

In recent years, a learning technique has been widely studied and

equipped to eliminate the outliers and/or estimate model parame-

ters through training a deep regressor [338–340] or classifier [341–

343]. In general, parameter regression and inlier/outlier classification

are trained jointly for performance enhancement [341]. Deep regres-

sor is inspired by the classical RANSAC and aims to estimate the

transformation model, such as fundamental matrix [344] or epipo-

lar geometry [339]. For example, a differentiable RANSAC, namely,

DSAC [338], is trained in an end-to-end manner by using reinforcement

learning, while other efforts are made to improve the sampling strat-

egy [339,340]. As for classifier learning, Yi et al. [341] first introduced

a pipeline called LFGC to find good feature correspondences by training

a network from a set of putative match sets together with their image

intrinsics. Ma et al. [342] proposed LMR, a general two-class classifier

learning framework for mismatch removal, by using a few training

image pairs and handcrafted geometrical representations for training

and testing. Zhang et al. [345] focused more on geometrical recovery

in their order-aware networks. Apart from learning with multilayer

perceptron (MLP), another method conducted this task with graph

convolutional networks (GCN) [346].

Learning from point data is not as easy as that on raw images by

using deep convolutional networks due to the unordered structure and

dispersed nature of the sparse points. Even so, this approach is still

worthy of attention because many recent studies have shown great

potential in using the GCN and MLP network structure together with

tailored normalization terms to learn to address sparse point-based

tasks [2,333].

2.3. Summary

Existing matching methods for multimodal images could be sys-

tematically classified into area-based and feature-based pipelines. The

area-based method would be largely affected by the choice of measure

metric. Two possible strategies are commonly applied for MMIM: one

is the use of a modality-independent measure metric such as MI and

its varieties, and the other is the reduction of different modalities into

a common domain. However, area-based methods are limited by their

large computational burden and their requirement for image pairs to

have large overlaps and undergo slight geometrical deformations. Deep

metric learning or deep transformation estimation in this framework

dramatically alleviates the predicament in matching criteria design and

iterative optimization. However, this learning strategy is still restricted

by the high image resolution and large or complex deformations, partic-

ularly with the requirement of sufficient training data. A feature-based

pipeline can efficiently address the problem in geometrical deforma-

tion. Direct feature matching, such as graph matching and point set

registration, is more suitable for image pairs containing less texture

(even binary images), or heavy modality or semantic variances. In these

cases, the image patch-based descriptor would be invalid. However,

the graph structure among potential true point correspondences may

be steadily preserved, which requires an overall corresponding matrix

to be optimized to find an optimal solution. But these direct feature-

based matching methods are limited by high computational burden

and outlier sensitivity. As for indirect feature matching, extracting a

high number and ratio of interest points then constructing distinctive

descriptions and corresponding them accurately are difficult because

of the significant nonlinear intensity variance between two modalities.

Proposing an advanced paradigm with better registration performance

in terms of both accuracy and efficiency remains an open problem for

researchers.

3. Modality-independent review

As aforementioned, matching for multimodal images can be seen as

one specific case. Apart from these challenges presented in the general

image matching task [2], the primary one in the multimodal case also

includes eliminating the domain or modality gap between two input

images. However, the natures of images would vary considerably across

different imaging sensors or data types in different research areas.

Thus, in the following, we will review these typical and latest image

matching methods that are designed for specific multimodal scenarios

under different research areas, including medical, remote sensing, and

computer vision. Simultaneously, the image natures of each modality

will be analyzed first in the corresponding parts to reinforce the under-

standing of the challenges and need for image registration of different

modalities.

3.1. Medical

The biggest family of MMIR may lie in the medical community.

With the rapid development of visualization of computer imaging tech-

niques, medical imaging has developed from statistic to dynamic, plane

to solid, and morphological to functional imaging, which has played

a significant role in modern medical diagnosis. Commonly used med-

ical imaging techniques include X-ray, ultrasonic imaging (UI or US),

computer tomography (CT), single-photon emission computed tomog-

raphy (SPECT), magnetic resonance imaging (MRI), positron emission

tomography (PET), and functional magnetic resonance imaging (fMRI),

generating medical images of different modalities. From the perspective

of medical applications, these modalities can be briefly classified into

anatomical images such as CT and MRI, and functional images such

as PET, fMRI, and SPECT [347]. Anatomical images have high spatial

resolution that can clearly display geometrical information, such as the

anatomical structures of viscera and bones, but without any functional

information. In contrast, functional images can well display the func-

tional transformation during the metabolic procedure, but the images

are usually not clear enough to reveal the structure information, re-

sulting in difficulties in anatomical structure and boundary localization.

Consequently, the complementary information from the images of these

two types needs to be combined, which first requires the image pairs

to be spatially aligned. The registration target also includes the MRI

with different weights, such as T1, T2, and proton density (PD), or

the retinal images of different angiographies such as digital subtraction

angiography (DSA), fundus photography and fluoroscopy angiography

(FA).

In the field of medical research, MMIM is a hot topic and has

been giving rise to an increasing number and diversity of registration

techniques. Several main strategies have been proposed to solve this

Information Fusion 73 (2021) 22–71

X. Jiang et al.

problem, including (1) information theoretic similarity measurement;

(2) reduction of the multimodal problem into a monomodal problem

(modality unification); (3) interest point extraction and matching-based

pipeline. All these strategies can be implemented by using deep convo-

lutional networks, which will be the main focus in a new subsection.

3.1.1. Information theory-based

In past decades, information theoretic similarity measurements suc-

cessfully alleviated the gap between multimodal image pairs in the

registration task, which have been widely investigated and extended

into more advanced forms. This step benefits from the successful use

of MI, introduced and popularized by Viola and Wells [46,47], and

Collignon and Maes [348,349]. In recent years, Maes et al. [347] rec-

ognized that the MI measure gave rise to revolutionary breakthroughs

in the MMIR task. However, the widespread use and study of MI have

revealed some of its shortcomings. Primarily, it is not overlap-invariant.

Thus, MI may be maximized in certain cases when the images become

misaligned.

Following the pipeline of maximizing the MI score for MMIR,

numerous advanced information theoretic approaches have been in-

vestigated to remedy the abovementioned shortcoming. For example,

Studholme et al. [48] proposed a normalized version of MI, namely,

NMI, to better register slices through clinical MR and CT image volumes

of the brain. An upper bound on the maximum MI [49] is studied for

deformable image registration, which can provide further insight into

the use of MI as a similarity metric. In addition, cMI [50] is proposed

as an improved similarity metric for nonrigid registration. cMI is

conducted as a 3D joint histogram based on both intensity and spatial

dimensions, and incorporated in a tensor-product B-spline nonrigid

registration method by using either a Parzen window or generalized

partial volume kernel for histogram construction. In [350], the authors

proposed a hybrid strategy that combines the spatial information with

MI to achieve multimodal retinal image registration.

Many researchers utilized the divergence measures to compare the

joint intensity distributions in MMIR, including Kullback–Leibler diver-

gence (KLD) [52,53] and Jensen–Shannon divergence (JSD) [54]. The

use of Renyi entropy [351,352] has also attracted great attention in the

registration problem, which is conducted with minimum spanning tree

or spanning graphs [353], or by integrating with KLD [354] for better

generalization.

Considering that these statistic measurements are commonly based

on a single pixel joint distribution model, the statistic criteria are also

implemented in a global or local region. For example, building from a

linear weighted sum of local evaluation of MI, Studholme et al. [51]

proposed RMI to reduce the errors caused by local intensity changes.

Others used octrees [355] or locally distributed functions [356].

Many researchers have been paying increasing attention to opti-

mization methods to quickly and accurately estimate transformation

models. Wachowiak et al. [83] considered that local optimization

techniques frequently fail, because these metric functions with respect

to transformation parameters are generally nonconvex and irregular

during an area-based procedure. Hence, they modified an evolutionary

approach that involves particle swarm optimization for biomedical

MMIR. Arce et al. [357] used the MRF coefficient under a Bayesian

formulation to model local intensity polynomial transformations, while

local geometric transformations are modeled as prior information with

MRF to register both rigid and nonrigid brain images of MRI T1 and T2

modalities. Moreover, Freiman et al. [358] presented a new nonuni-

form sampling method for the accurate estimation of MI in multi-

modal brain image rigid registration. This method uses 3D fast discrete

curvelet transformation to reduce the sampled voxels’ interdependency

by sampling voxels that are less dependent on their neighborhood,

thus providing a more accurate estimation of the MI. Following the

NMI and FFD registration pipeline, Yang et al. [359] introduced a

cooperative coevolving-based optimization method that combines the

limited-memory Broyden–Fletcher–Goldfarb–Shanno with boundaries

(L-BFGS-B) and cat swarm optimization for nonrigid MMIR. In this

method, the block grouping strategy can capture the interdependency

of all variables, thus achieving fast convergence and better registration

accuracy of 3D CT, PET, and T1-, T2-, PD-weighted MR images.

In recent years and with spatial information taken into account,

Legg et al. [360] proposed feature neighborhood MI in their two-stage

nonrigid registration framework to align paired retinal fundus pho-

tographs and confocal scanning laser ophthalmoscope (CSLO) images.

This improved MI is superior to many existing MI variants, such as

original MI, gradient MI, gradient-image MI, second-order MI, regional

MI, feature MI, and neighborhood incorporated MI.

The methods explored in the early part of this decade were com-

prehensively reviewed in [29]; readers may refer to this work for more

details. The measure metrics that use deep learning are reviewed in the

part of learning-based methods.

3.1.2. Modality unification-based

Another strategy aims to transform two different modalities into a

common domain, making it workable for general measuring metrics

that are successfully used in monomodal image matching. Two possible

ways are used to reduce this problem into a monomodal one: simulating

one modality from another and mapping both modalities into a third

one. In this part, we review typical and related handcrafted approaches

that follow this idea. Approaches that use deep networks will be intro-

duced in the learning-based methods, such as style transfer learning and

descriptor learning. Refer to the corresponding part for more details.

Following this strategy, several studies aim to map one modality to

another according to the physical properties of the imaging device. To

into the domain of US images on the basis of their intensities and

gradient information. The registration is performed with a rigid model

and based on the expended correlation ratio method. Another method

in [362], aims to generate pseudo-US images from CT by exploiting

the physical principles of US images, thus achieving CT-US rigid/affine

registration optimized under a locally evaluated statistical criterion. In

addition to mapping one modality to another in a global manner, the

local patch-based strategy is also studied to identify the unreliable areas

if directly using the MI metric, then the small patches of these areas are

simulated to a common domain [363]. The mapping strategy is also

conducted with a learning strategy, which will be reviewed in the part

of learning-based methods.

Another way to map two different image modalities into a common

one is to exploit the morphological information, such as edge or contour

structures, which commonly exist in both modalities. Many approaches

directly extract these morphological or structure information through

filtering [364,365] or by using existing edge extractors. Gabor filtering

can easily capture texture information from raw images, which is why

it is widely used for modality unification [364,366], as also conducted

by several local frequency representations [365].

Local descriptors can also map the target pixel or voxel into a

distinctive vector form in a high-dimensional space, making similar-

ity measurement more convenient and thus making the optimization

process more effective. Inspired by this idea, many methods aim to

reduce the multimodality to uniform domain on the basis of the con-

cept of self-similarity image representation, which was first studied

in local self-similarities (LSS) by Shechtman et al. [367]. LSS is a

local feature descriptor that can capture the internal geometric lay-

outs of LSS within images and indirectly represents the local image

property, which is why it can be used to match two textured regions

with significant appearance variance but similar layouts or geometric

shapes. In addition, a descriptor called modality independent neigh-

borhood descriptor (MIND) [368] is proposed to extract the distinct

structure in a local neighborhood to generate description vectors, thus

transforming the images of different modalities into a third domain,

whose similarity is easily measured by arbitrary metrics such as SSD.

The authors apply this descriptor within a symmetric nonparametric

剩余49页未读，继续阅读

Aries_59

粉丝: 2

多模态图像匹配技术与应用综述

imagematching

RIFT-multimodal-image-matching-main.zip

图像配准 综述

Vectorized multimodal LSTM using Matlab and GPU.zip

A Review on Explainability in Multimodal Deep Neural Nets.pdf

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT.pdf

Social-BiGAT Multimodal Trajectory Forecasting论文翻译.pdf

Multimodal Topic Learning for Video Recommendation.zip

Multimodal Chain-of-Thought Reasoning in Language Models.pdf

Deep Multimodal Subspace Clustering Networks.pdf

最新资源

图像配准综述