3
function [47], random forest [37] and anchored neigh-
borhood regression [41], [42] are proposed to further
improve the mapping accuracy and speed. The sparse-
coding-based method and its several improvements [41],
[42], [48] are among the state-of-the-art SR methods
nowadays. In these methods, the patches are the focus
of the optimization; the patch extraction and aggregation
steps are considered as pre/post-processing and handled
separately.
The majority of SR algorithms [2], [4], [15], [41], [48],
[49], [50], [51] focus on gray-scale or single-channel
image super-resolution. For color images, the aforemen-
tioned methods first transform the problem to a dif-
ferent color space (YCbCr or YUV), and SR is applied
only on the luminance channel. There are also works
attempting to super-resolve all channels simultaneously.
For example, Kim and Kwon [25] and Dai et al. [7] apply
their model to each RGB channel and combined them to
produce the final results. However, none of them has
analyzed the SR performance of different channels, and
the necessity of recovering all three channels.
2.2 Convolutional Neural Networks
Convolutional neural networks (CNN) date back
decades [27] and deep CNNs have recently shown an
explosive popularity partially due to its success in image
classification [18], [26]. They have also been success-
fully applied to other computer vision fields, such as
object detection [34], [40], [52], face recognition [39], and
pedestrian detection [35]. Several factors are of central
importance in this progress: (i) the efficient training
implementation on modern powerful GPUs [26], (ii) the
proposal of the Rectified Linear Unit (ReLU) [33] which
makes convergence much faster while still presents good
quality [26], and (iii) the easy access to an abundance of
data (like ImageNet [9]) for training larger models. Our
method also benefits from these progresses.
2.3 Deep Learning for Image Restoration
There have been a few studies of using deep learning
techniques for image restoration. The multi-layer per-
ceptron (MLP), whose all layers are fully-connected (in
contrast to convolutional), is applied for natural image
denoising [3] and post-deblurring denoising [36]. More
closely related to our work, the convolutional neural net-
work is applied for natural image denoising [22] and re-
moving noisy patterns (dirt/rain) [12]. These restoration
problems are more or less denoising-driven. Cui et al. [5]
propose to embed auto-encoder networks in their super-
resolution pipeline under the notion internal example-
based approach [16]. The deep model is not specifically
designed to be an end-to-end solution, since each layer
of the cascade requires independent optimization of the
self-similarity search process and the auto-encoder. On
the contrary, the proposed SRCNN optimizes an end-to-
end mapping. Further, the SRCNN is faster at speed. It
is not only a quantitatively superior method, but also a
practically useful one.
3 CONVOLUTIONAL NEURAL NETWORKS FOR
SUPER-RESOLUTION
3.1 Formulation
Consider a single low-resolution image, we first upscale
it to the desired size using bicubic interpolation, which
is the only pre-processing we perform
3
. Let us denote
the interpolated image as Y. Our goal is to recover
from Y an image F (Y) that is as similar as possible
to the ground truth high-resolution image X. For the
ease of presentation, we still call Y a “low-resolution”
image, although it has the same size as X. We wish to
learn a mapping F , which conceptually consists of three
operations:
1) Patch extraction and representation: this opera-
tion extracts (overlapping) patches from the low-
resolution image Y and represents each patch as a
high-dimensional vector. These vectors comprise a
set of feature maps, of which the number equals to
the dimensionality of the vectors.
2) Non-linear mapping: this operation nonlinearly
maps each high-dimensional vector onto another
high-dimensional vector. Each mapped vector is
conceptually the representation of a high-resolution
patch. These vectors comprise another set of feature
maps.
3) Reconstruction: this operation aggregates the
above high-resolution patch-wise representations
to generate the final high-resolution image. This
image is expected to be similar to the ground truth
X.
We will show that all these operations form a convolu-
tional neural network. An overview of the network is
depicted in Figure 2. Next we detail our definition of
each operation.
3.1.1 Patch extraction and representation
A popular strategy in image restoration (e.g., [1]) is to
densely extract patches and then represent them by a set
of pre-trained bases such as PCA, DCT, Haar, etc. This
is equivalent to convolving the image by a set of filters,
each of which is a basis. In our formulation, we involve
the optimization of these bases into the optimization of
the network. Formally, our first layer is expressed as an
operation F
1
:
F
1
(Y) = max (0, W
1
∗ Y + B
1
) , (1)
where W
1
and B
1
represent the filters and biases re-
spectively, and ’∗’ denotes the convolution operation.
Here, W
1
corresponds to n
1
filters of support c × f
1
× f
1
,
where c is the number of channels in the input image,
f
1
is the spatial size of a filter. Intuitively, W
1
applies
n
1
convolutions on the image, and each convolution has
3. Bicubic interpolation is also a convolutional operation, so it can
be formulated as a convolutional layer. However, the output size of
this layer is larger than the input size, so there is a fractional stride. To
take advantage of the popular well-optimized implementations such
as cuda-convnet [26], we exclude this “layer” from learning.