kernel. In fact, we do not consider gradient direction in-
formation since gradient intensity is adequate to reveal the
sharpness of local regions in recovered images. Hence we
adopt the intensity maps as the gradient maps. Such gradi-
ent maps can be regarded as another kind of images, so that
techniques for image-to-image translation can be utilized
to learn the mapping between two modalities. The transla-
tion process is equivalent to the spatial distribution transla-
tion from LR edge sharpness to HR edge sharpness. Since
most area of the gradient map is close to zero, the convolu-
tional neural network can concentrates more on the spatial
relationship of outlines. Therefore, it may be easier for the
network to capture structure dependency and consequently
produce approximate gradient maps for SR images.
As shown in Figure 2, the gradient branch incorpo-
rates several intermediate-level representations from the SR
branch. The motivation of such scheme is that the well-
designed SR branch is capable of carrying rich structural in-
formation which is pivotal to the recovery of gradient maps.
Hence we utilize the features as a strong prior to promote
the performance of the gradient branch, whose parameters
can be largely reduced in this case. Between each two inter-
mediate features, there is a gradient block which can be any
basic block to extract higher-level features. Once we get
the SR gradient maps by the gradient branch, we are able to
integrate the obtained gradient features into the SR branch
to guide SR reconstruction in turn. The magnitude of gra-
dient map can implicitly reflect whether a recovered region
should be sharp or smooth. In practice, we feed the feature
maps produced by the next-to-last layer of gradient branch
to the SR branch. Meanwhile, we generate the output gra-
dient maps by a 1 × 1 convolution layer with these feature
maps as inputs.
3.2.2 Structure-Preserving SR Branch
We design a structure-preserving SR branch to get the final
SR outputs. This branch constitutes of two parts. The first
part is a regular SR network comprising of multiple gener-
ative neural blocks which can be any architecture. Here we
introduce the Residual in Residual Dense Block (RRDB)
proposed in ESRGAN [42]. There are 23 RRDB blocks in
the original model. Therefore, we incorporate the feature
maps from the 5th, 10th, 15th, 20th blocks to the gradi-
ent branch. Since regular SR models produce images with
only 3 channels, we remove the last convolutional recon-
struction layer and feed the output feature to the consecu-
tive part. The second part of the SR branch wires the SR
gradient feature maps obtained from the gradient branch as
mentioned above. We fuse the structure information by a
fusion block which fuses the features from two branches to-
gether. Specifically, we concatenate the two features and
then use another RRDB block and convolutional layer to
reconstruct the final SR features. It is noteworthy that we
only add one RRDB block into the SR branch. Thus the pa-
rameter increment is slight compared to the original model
with 23 blocks.
3.3. Objective Functions
Conventional Loss: Most SR methods optimize the
elaborately designed networks by a common pixelwise loss,
which is efficient for the task of super resolution measured
by PSNR. This metric can reduce the average pixel differ-
ence between recovered images and ground-truths but the
results may be too smooth to maintain sharp edges for visual
effects. However, this loss is still widely used to accelerate
convergence and improve SR performance:
L
P ix
I
SR
= E
I
S R
kG(I
LR
) − I
HR
k
1
. (3)
Perceptual loss has been proposed in [20] to improve per-
ceptual quality of recovered images. Features containing se-
mantic information are extracted by a pre-trained VGG net-
work [36]. The Euclidean distances between the features of
HR images and SR ones are minimized in perceptual loss:
L
P er
SR
= E
I
S R
kφ
i
(G(I
LR
)) − φ
i
(I
HR
)k
1
, (4)
where φ
i
(.) denotes the ith layer output of the VGG model.
Methods [27, 42] based on generative adversarial net-
works (GANs) [3, 4, 15, 16, 21, 33] also play an important
role in the SR problem. The discriminator D
I
and the gen-
erator G are optimized by a two-player game as follows:
L
Dis
I
SR
= −E
I
S R
[log(1 − D
I
(I
SR
))]
−E
I
HR
[log D
I
(I
HR
)], (5)
L
Adv
I
SR
= −E
I
S R
[log D
I
(G(I
LR
))]. (6)
Following [21, 42] we conduct relativistic average GAN
(RaGAN) to achieve better optimization in practice. Mod-
els supervised by the above objective functions merely con-
sider the image-space constraint for images, but neglect the
semantically structural information provided by the gradi-
ent space. While the generated results look photo-realistic,
there are also a number of undesired geometric distortions.
Thus we introduce the gradient loss to alleviate this issue.
Gradient Loss: Our motivation can be illustrated clearly
by Figure 3. Here we only consider a simple 1-dimensional
case. If the model is only optimized in image space by the
L1 loss, we usually get a SR sequence as Figure 3 (b) given
an input testing sequence whose ground-truth is a sharp
edge as Figure 3 (a). The model fails to recover sharp edges
for the reason that the model tends to give an statistical av-
erage of possible HR solutions from training data. In this
case, if we compute and show the gradient magnitudes of
two sequences, it can be observed that the SR gradient is
flat with low values while the HR gradient is a spike with