1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2887017, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING 3
came up with robust video super-resolution with learned tem-
poral dynamics (RVSR-LTD) [16], which creates a temporal
adaptive neural network to adaptively determine the optimal
scale of temporal dependencies. Even though, RVSR-LTD
learns from the structure of ESPCN [28] (a simple three-layer
convolution) to incorporate with the temporal adaptive neural
network, which limits its performance.
Above all, BRCN, VESPCN, VSRnet and RVSR-LTD only
use a simple and direct way of connection, thus resulting in
the shallow depth and simple structure of their underlying
networks. In addition, VESPCN, VSRnet and DRVSR take
pre-amplified images as inputs, so that magnified large im-
ages consume huge GPU memory and computational cost.
Therefore, it is urged to avoid these weaknesses and make
improvements to obtain more realistic image details.
Fig. 1. Different skip connection schemes. (a) No skip connection. (b)
Distinct-source skip connection. (c) Shared-source skip connection. (d) Dense
skip connection.
C. Inter-layer Connection
How to design an effective network structure and improve
the stability of the model has always been a significant part
of neural network research. Recently, with the help of skip
connection [34], [35], the popularity of deep neural networks
has revived again. As shown in Figure 1, different skip con-
nection schemes have been proposed to form the deep neural
networks. Resnet [35] uses bypassing path between layers to
effectively train networks with more than 100 layers. Huang
et al. [36] randomly dropped layers to improve the training
of deep residual networks, which demonstrates the fact that
there exists a great amount of redundancy in deep residual
networks. DenseNet [37] links all layers in the networks and
attempts to fully explore the advantages of skip connections.
Further, these ideas have been tailored to support SISR and
video SR. For instance, instead of learning to reconstruct HR
image directly, VDSR [24] adopts one skip connection to learn
the residual image, and adds it to the bicubic amplified LR
image to obtain the SR image. Tong et al. [38] and Zhang et
al. [30] both adopted dense skip connections in their network
for SISR, and achieved promising results.
III. OUR METHOD
In this section, we present the design methodology for the
proposed MMCNN network, including the whole architecture
and details about individual modules.
A. Architecture
As shown in Figure 2, our model consists of two parts:
optical flow network and image-reconstruction network. Video
SR model aims to estimate one HR frame from a serial
of adjacent LR frames, and we therefore use the optical
flow network to estimate the motion between current frame
and the reference frame above all. Then, we use the optical
flow for motion compensation and transforming the input
LR frames into warped frames. After that, we send these
warped frames to the image-reconstruction network, which
is further composed of 4 modules: feature extraction, multi-
memory detail fusion, feature reconstruction and sub-pixel
magnification. We elaborate these modules in the following
respectively.
B. Motion Estimation and Compensation
Motion estimation and compensation are widely studied
for video processing. Jaderberg et al. proposed a spatial
transformer networks [39], which is a differentiable module
and can be used to spatially transform feature maps. In video
SR, motion estimation and compensation are mainly adopted
to represent the temporal correlations among consecutive LR
frames. Joint motion compensation for SR with neural net-
works has also been studied through recurrent bidirectional
networks [18], [21], [22]. A motion compensation scheme
based on spatial transformers has been designed [21], which
is combined with spatio-temporal models to enable a very
efficient solution for video SR. In general, a motion estimation
module takes two frames as inputs and produces an optical
flow vector field as follows:
F
i→j
= (u
i→j
, v
i→j
) = M E(I
i
, I
j
; θ
ME
), (1)
where F
i→j
denotes the optical flow field generated from input
frame I
i
to I
j
, M E(·) is the operator for calculating optical
flow, and θ
ME
is the parameter for operator M E(·).
We have tested Flownet [40] and its improved versions like
Flownet-SD [41] and Flownet2 [42]. However, these networks
have large number of parameters and heavy computational
cost. Thus, we choose the motion compensation transformer
operator (MCT) [21] as [22] does, which is easier to be trained
and makes it possible for us to train both the optical flow
network and the image-reconstruction network simultaneously.
For motion compensation, it is used for spatial alignment,
whose process can be described as:
J = M C(I, F ; θ
MC
), (2)
where J denotes the warped image, MC(·) is the operator
for motion compensation, I represents the input image, F
stands for the optical flow field, and θ
MC
is the parameter
for operator M C(·).
Based on spatial transformer network [39], Caballero et
al. [21] proposed a multi-scale spatial transformer motion
compensation method, which extracts the optical flow in a
coarse-to-fine manner. Further, Tao et al. [22] proposed a
SPMC layer, which projects the compensated frame from LR
space to HR space. We have tested these two methods and
decide to adopt MC from VESPCN [21], which shows a little