Information Fusion 73 (2021) 22–71
31
X. Jiang et al.
problem, including (1) information theoretic similarity measurement;
(2) reduction of the multimodal problem into a monomodal problem
(modality unification); (3) interest point extraction and matching-based
pipeline. All these strategies can be implemented by using deep convo-
lutional networks, which will be the main focus in a new subsection.
3.1.1. Information theory-based
In past decades, information theoretic similarity measurements suc-
cessfully alleviated the gap between multimodal image pairs in the
registration task, which have been widely investigated and extended
into more advanced forms. This step benefits from the successful use
of MI, introduced and popularized by Viola and Wells [46,47], and
Collignon and Maes [348,349]. In recent years, Maes et al. [347] rec-
ognized that the MI measure gave rise to revolutionary breakthroughs
in the MMIR task. However, the widespread use and study of MI have
revealed some of its shortcomings. Primarily, it is not overlap-invariant.
Thus, MI may be maximized in certain cases when the images become
misaligned.
Following the pipeline of maximizing the MI score for MMIR,
numerous advanced information theoretic approaches have been in-
vestigated to remedy the abovementioned shortcoming. For example,
Studholme et al. [48] proposed a normalized version of MI, namely,
NMI, to better register slices through clinical MR and CT image volumes
of the brain. An upper bound on the maximum MI [49] is studied for
deformable image registration, which can provide further insight into
the use of MI as a similarity metric. In addition, cMI [50] is proposed
as an improved similarity metric for nonrigid registration. cMI is
conducted as a 3D joint histogram based on both intensity and spatial
dimensions, and incorporated in a tensor-product B-spline nonrigid
registration method by using either a Parzen window or generalized
partial volume kernel for histogram construction. In [350], the authors
proposed a hybrid strategy that combines the spatial information with
MI to achieve multimodal retinal image registration.
Many researchers utilized the divergence measures to compare the
joint intensity distributions in MMIR, including Kullback–Leibler diver-
gence (KLD) [52,53] and Jensen–Shannon divergence (JSD) [54]. The
use of Renyi entropy [351,352] has also attracted great attention in the
registration problem, which is conducted with minimum spanning tree
or spanning graphs [353], or by integrating with KLD [354] for better
generalization.
Considering that these statistic measurements are commonly based
on a single pixel joint distribution model, the statistic criteria are also
implemented in a global or local region. For example, building from a
linear weighted sum of local evaluation of MI, Studholme et al. [51]
proposed RMI to reduce the errors caused by local intensity changes.
Others used octrees [355] or locally distributed functions [356].
Many researchers have been paying increasing attention to opti-
mization methods to quickly and accurately estimate transformation
models. Wachowiak et al. [83] considered that local optimization
techniques frequently fail, because these metric functions with respect
to transformation parameters are generally nonconvex and irregular
during an area-based procedure. Hence, they modified an evolutionary
approach that involves particle swarm optimization for biomedical
MMIR. Arce et al. [357] used the MRF coefficient under a Bayesian
formulation to model local intensity polynomial transformations, while
local geometric transformations are modeled as prior information with
MRF to register both rigid and nonrigid brain images of MRI T1 and T2
modalities. Moreover, Freiman et al. [358] presented a new nonuni-
form sampling method for the accurate estimation of MI in multi-
modal brain image rigid registration. This method uses 3D fast discrete
curvelet transformation to reduce the sampled voxels’ interdependency
by sampling voxels that are less dependent on their neighborhood,
thus providing a more accurate estimation of the MI. Following the
NMI and FFD registration pipeline, Yang et al. [359] introduced a
cooperative coevolving-based optimization method that combines the
limited-memory Broyden–Fletcher–Goldfarb–Shanno with boundaries
(L-BFGS-B) and cat swarm optimization for nonrigid MMIR. In this
method, the block grouping strategy can capture the interdependency
of all variables, thus achieving fast convergence and better registration
accuracy of 3D CT, PET, and T1-, T2-, PD-weighted MR images.
In recent years and with spatial information taken into account,
Legg et al. [360] proposed feature neighborhood MI in their two-stage
nonrigid registration framework to align paired retinal fundus pho-
tographs and confocal scanning laser ophthalmoscope (CSLO) images.
This improved MI is superior to many existing MI variants, such as
original MI, gradient MI, gradient-image MI, second-order MI, regional
MI, feature MI, and neighborhood incorporated MI.
The methods explored in the early part of this decade were com-
prehensively reviewed in [29]; readers may refer to this work for more
details. The measure metrics that use deep learning are reviewed in the
part of learning-based methods.
3.1.2. Modality unification-based
Another strategy aims to transform two different modalities into a
common domain, making it workable for general measuring metrics
that are successfully used in monomodal image matching. Two possible
ways are used to reduce this problem into a monomodal one: simulating
one modality from another and mapping both modalities into a third
one. In this part, we review typical and related handcrafted approaches
that follow this idea. Approaches that use deep networks will be intro-
duced in the learning-based methods, such as style transfer learning and
descriptor learning. Refer to the corresponding part for more details.
Following this strategy, several studies aim to map one modality to
another according to the physical properties of the imaging device. To
register US and MR images, Roche et al. [361] transformed MR images
into the domain of US images on the basis of their intensities and
gradient information. The registration is performed with a rigid model
and based on the expended correlation ratio method. Another method
in [362], aims to generate pseudo-US images from CT by exploiting
the physical principles of US images, thus achieving CT-US rigid/affine
registration optimized under a locally evaluated statistical criterion. In
addition to mapping one modality to another in a global manner, the
local patch-based strategy is also studied to identify the unreliable areas
if directly using the MI metric, then the small patches of these areas are
simulated to a common domain [363]. The mapping strategy is also
conducted with a learning strategy, which will be reviewed in the part
of learning-based methods.
Another way to map two different image modalities into a common
one is to exploit the morphological information, such as edge or contour
structures, which commonly exist in both modalities. Many approaches
directly extract these morphological or structure information through
filtering [364,365] or by using existing edge extractors. Gabor filtering
can easily capture texture information from raw images, which is why
it is widely used for modality unification [364,366], as also conducted
by several local frequency representations [365].
Local descriptors can also map the target pixel or voxel into a
distinctive vector form in a high-dimensional space, making similar-
ity measurement more convenient and thus making the optimization
process more effective. Inspired by this idea, many methods aim to
reduce the multimodality to uniform domain on the basis of the con-
cept of self-similarity image representation, which was first studied
in local self-similarities (LSS) by Shechtman et al. [367]. LSS is a
local feature descriptor that can capture the internal geometric lay-
outs of LSS within images and indirectly represents the local image
property, which is why it can be used to match two textured regions
with significant appearance variance but similar layouts or geometric
shapes. In addition, a descriptor called modality independent neigh-
borhood descriptor (MIND) [368] is proposed to extract the distinct
structure in a local neighborhood to generate description vectors, thus
transforming the images of different modalities into a third domain,
whose similarity is easily measured by arbitrary metrics such as SSD.
The authors apply this descriptor within a symmetric nonparametric