preserving warping (CPW) [31] to align overlapping regions for
small local adjustment while using the homography to maintain
the global image structure. Different from aligning pixels of the
overlapping area, Lin et al. [28] proposed to find a local area to
stitch images, which can protect the curves and lines during
stitching.
Although traditional image stitching methods have achieved
promising performance, they cannot handle low-texture scenarios.
2.2. Deep image stitching
Deep image stitching is still in development, since the labeled
data is hard to collect. In [37,50], synthetic datasets are proposed
to solve this problem. Besides, a content revision network is pro-
posed to generate the stitched image after image registration in
[37].
However, the performance of these methods in real-world data-
sets is not reliable and the resolution of the network input is
limited.
2.3. Deep Homography schemes
Homography estimation is an important part of image stitching,
and deep homography can also be regarded as a significant step in
deep image stitching. The deep homography solution was first pro-
posed in [8], where a synthetic dataset and a VGG-style solution
are put forward together. Then, Nguyen et al. [36] proposed an
unsupervised version for [8], in which a photometric loss is
adopted to measure the pixel error between warped images. Le
et al. [22] and Zhang et al. [48] proposed content-aware networks
to reject parallax regions and dynamic areas. And deep Lucas-
Kanade networks [3,51] are also presented to align a template
image with a source image. Besides, Koguciuk et al. [20] propose
to increase the robustness using perceptual loss. Ye et al. [45]
replace homography offset with motion basis to enhance the esti-
mation performance.
Nevertheless, when it comes to scenes of low overlap rates, The
performance of these solutions drops because of the limited recep-
tive fields of convolutional layers.
3. Our method
In this section, we discuss our multi-scale deep homography
module, edge-preserved deformation module, and size-free
schemes, respectively.
3.1. Multi-scale deep homography
Although deep homography methods in scenes of high overlap
rates [8,36,48,22,3] have outperformed traditional solutions, deep
homography estimation in scenes of low overlap rates is still chal-
lenging due to the limited receptive fields of neural networks. To
overcome this challenge, the proposed multi-scale deep homogra-
phy network integrates feature pyramid and feature correlation
into a network, increasing the utilization of feature maps and
expanding the receptive field, respectively. The architecture of
the proposed multi-scale deep homography network is shown in
Fig. 2.
Feature Pyramid. After the images are fed into our network,
they will be processed by 8 convolutional layers, where the num-
ber of filters per layer is set to 64, 64, 128, 128, 256, 256, 512,
and 512, respectively. A max-pooling layer is adopted every two
convolutional layers to represent multi-scale features as
F; F
1=2
; F
1=4
, and F
1=8
. As shown in Fig. 2, we select F
1=2
; F
1=4
, and
F
1=8
to form a three-layer feature pyramid. The features of each
layer in the pyramid are used to estimate the homography, and
we transmit the estimated homography of the upper layer to the
lower layer to enhance the prediction accuracy progressively.
Besides, among the features of the four scales, the features of three
scales will be used for subsequent homography regression, signif-
icantly improving the feature utilization.
Feature Correlation. To increase the receptive fields of our net-
work, the feature correlation layer [38,14,39,18] is used here to
strengthen feature matching explicitly. Formally, the correlation c
between the reference feature F
l
A
2 W
l
H
l
C
l
and the target fea-
ture F
l
B
2 W
l
H
l
C
l
can be calculated as,
cx
l
A
; x
l
B
¼
< F
l
A
x
l
A
; F
l
B
x
l
B
>
jF
l
A
x
l
A
jjF
l
B
x
l
B
j
; x
l
A
; x
l
B
2 Z
2
; ð1Þ
where x
l
A
; x
l
B
are the 2-D spatial location in F
l
A
and F
l
B
, respectively.
Specifying the search radius on the axis of width (or height) as R
w
(or R
h
), we obtain c 2 W
l
H
l
2R
w
þ 1ðÞ2R
h
þ 1ðÞ[43] by Eq. 1.
Specifically, we calculate the global correlation by setting R
w
(or
R
h
) equal to W
l
(or H
l
), and we calculate the local correlation when
R
w
(or R
h
) is less than W
l
(or H
l
). By applying global correlation and
local correlation to our network, we predict the homography pro-
gressively from global to local.
After extracting pyramid features and calculating feature corre-
lations, we adopt a simple regression network that comprises three
convolutional layers and two fully connected layers to predict
eight vertices’ offsets of the target image that can uniquely deter-
mine a homography. To be more specific, every layer of our three-
layer pyramid predicts the residual offsets
D
i
; i ¼ 1; 2; 3. Every fea-
ture correlation in the pyramid is only calculated between the
warped target feature and the reference feature rather than
between the target feature and the reference feature. In this way,
each layer in the pyramid only learns to predict the residual
homography offsets instead of the complete offsets. And
D
i
can
be calculated as follows:
D
i
¼ H
4pt
F
1=2
4i
A
; W F
1=2
4i
B
; DLT
X
i1
n¼0
D
n
!*+()
; ð2Þ
where H
4pt
is the operation of estimating the residual offsets from
the reference feature map and the warped target feature map. W
warps the target feature map using the homography and DLT
converts the offsets to the corresponding homography. We specify
D
0
¼ 0, which means all predicted offsets are 0. The final predicted
offsets can be calculated as follows:
D
wh
¼
D
1
þ
D
2
þ
D
3
: ð3Þ
After that, image registration can be implemented by solving
the homography and warping the input images.
Objective Function: Our multi-scale deep homography is
trained in a supervised manner. Given the ground truth offsets
^
D
wh
, we designed the following objective function:
L
H
¼ w
1
^
D
wh
D
1
þw
2
^
D
wh
D
1
D
2
þw
3
^
D
wh
D
1
D
2
D
3
;
ð4Þ
where the w
1
; w
2
, and w
3
represent the weights of each layer in the
three-layer pyramid.
3.2. Edge-preserved deformation network
Stitching images with a global homography can easily produce
artifacts in scenes with parallax. To eliminate the ghosting effects,
L. Nie, C. Lin, K. Liao et al.
Neurocomputing 491 (2022) 533–543
535