Composition
Global
Transformation
Local
Transformation
C
I
r
I
t
Decoder
D
-
Encoder
E
Encoder
E
Share
Weights
Detail
Warp via H
Share
Weights
C
w w
- +
Subtraction
Addition
Composition
Warp via TPS
Legend
Stitched image (S)
I
wr
I
wt
M
cr
M
ct
Contextual
Correlation Layer
F
r
1/16
F
t
1/16
Regression
Net 1
H
F
r
1/8
Warped F
t
1/8
F
t
1/8
Warp
Contextual
Correlation Layer
TPS
w
I
r
/ I
t
Regression
Net 2
w
4-pt motion
DLT
TPS
Solving
(Eq. 4)
Residual control
point motion
+
Initial
motion
Warp
ResNet50
ResNet50
Figure 2: An overview of the proposed parallax-tolerant unsupervised stitching network. Our framework consists of two
stages: warp and composition. The first stage predicts a robust and flexible warp to align images with shape preservation.
The second stage composites the seamless stitched image by generating composition masks corresponding to warped images.
3.1.2 Pipeline of Warp
As shown in Fig.2, given I
r
, I
t
, we adopt ResNet50 [17]
with pretrained parameters as our backbone to extract se-
mantic features first. It maps a 3-channel image to the high-
dimensional semantic features with a resolution scaled to
1/16 of the original. Then the correlation between these
feature maps (F
1/16
r
and F
1/16
t
) can be aggregated into 2-
channel feature flows using the contextual correlation layer
[43]. Subsequently, a regression network is used to esti-
mate the 4-pt parameterization of the homography warp.
This global warp also generates the initial motions of con-
trol points.
Next, we warp the feature maps with higher resolution
(F
1/8
t
) to embed the homographic prior into the following
workflow. After another contextual correlation layer and
regression network, the residual motions of control points
are predicted, contributing to a robust flexible TPS warp.
3.1.3 Optimization of Warp
To achieve content alignment and shape preservation simul-
taneously, we design our objective function L
w
concerning
two aspects: alignment and distortion.
L
w
= L
w
alignment
+ ωL
w
distortion
. (5)
For the alignment, we encourage the overlapping regions
to keep consistent at the pixel level. Denoting φ(·, ·) is the
warping operation and 1 an all-one matrix with the same
resolution as I
r
, the alignment loss can be defined as fol-
lows:
L
w
alignment
=λ∥I
r
· φ(1, H) − φ(I
t
, H)∥
1
+
λ∥I
t
· φ(1, H
−1
) − φ(I
r
, H
−1
)∥
1
+
∥I
r
· φ(1, T PS) − φ(I
t
, T PS)∥
1
,
(6)
where H and T PS are warp parameters, and λ is a hyperpa-
rameter to balance the impacts of different transformations.
For the distortion, we link adjacent control points in
the warped target image to form a mesh and introduce
an inter-grid constraint ℓ
inter
and an intra-grid constraint
ℓ
intra
. The former preserves geometric structures for non-
overlapping regions, while the latter reduces projective dis-
tortions. In the beginning, we approximate a similar trans-
formation by DLT for every grid in non-overlapping regions
and take the 4-pt projective error as the loss. But this con-
straint that is commonly used in traditional methods [16, 37]
does not work in deep learning schemes. Instead, we re-
explore the constraints from a more intuitive perspective —
the grid edge.
Similar to [42], we penalize the grid edge e with the mag-
nitude exceeding a threshold. Denoting {e
hor
} and {e
v er
}
are the collections of horizontal and vertical edges, we de-
scribe the intra-grid constraint as follows:
ℓ
intra
=
1
(U+1)×V
X
{e
hor
}
σ(⟨e,
i⟩ −
2W
V
)+
1
U×(V +1)
X
{e
ver
}
σ(⟨e,
j⟩ −
2H
U
),
(7)
where
i /
j is the horizontal/vertical unit vector, and σ(·) is
the RELU function. The projective distortions are reduced
by preventing the grid shape from dramatic scaling.
By encouraging the edge pairs (successive edges in hor-
izontal or vertical directions, denoted as e
s1
, e
s2
) to be co-
linear, we formulate the inter-grid constraint as:
ℓ
inter
=
1
Q
X
{e
s1
,e
s2
}
S
s1,s2
· (1 −
⟨e
s1
, e
s2
⟩
∥ e
s1
∥ · ∥ e
s2
∥
), (8)
where Q is the number of edge pairs and S
s1,s2
is a 0-1 label
that is set to 1 if this edge pair locates on non-overlapping
regions. We only preserve the structures in non-overlapping
regions, preventing adverse effects on the alignment.
4