Accurate Image Super-Resolution Using Very Deep Convolutional Networks
Jiwon Kim, Jung Kwon Lee and Kyoung Mu Lee
Department of ECE, ASRI, Seoul National University, Korea
{j.kim, deruci, kyoungmu}@snu.ac.kr
Abstract
We present a highly accurate single-image super-
resolution (SR) method. Our method uses a very deep con-
volutional network inspired by VGG-net used for ImageNet
classification [
19]. We find increasing our network depth
shows a significant improvement in accuracy. Our final
model uses 20 weight layers. By cascading small filters
many times in a deep network structure, contextual infor-
mation over large image regions is exploited in an efficient
way. With very deep networks, however, convergence speed
becomes a critical issue during training. We propose a sim-
ple yet effective training procedure. We learn residuals only
and use extremely high learning rates (10
4
times higher
than SRCNN [
6]) enabled by adjustable gradient clipping.
Our proposed method performs better than existing meth-
ods in accuracy and visual improvements in our results are
easily noticeable.
1. Introduction
We address the problem of generating a high-resolution
(HR) image given a low-resolution (LR) image, commonly
referred as single image super-resolution (SISR) [
12], [8],
[
9]. SISR is widely used in computer vision applications
ranging from security and surveillance imaging to medical
imaging where more image details are required on demand.
Many SISR methods have been studied in the computer
vision community. Early methods include interpolation
such as bicubic interpolation and Lanczos resampling [
7]
more powerful methods utilizing statistical image priors
[
20, 13] or internal patch recurrence [9].
Currently, learning methods are widely used to model a
mapping from LR to HR patches. Neighbor embedding [
4,
15] methods interpolate the patch subspace. Sparse coding
[
25, 26, 21, 22] methods use a learned compact dictionary
based on sparse signal representation. Lately, random forest
[18] and convolutional neural network (CNN) [6] have also
been used with large improvements in accuracy.
Among them, Dong et al. [
6] has demonstrated that a
CNN can be used to learn a mapping from LR to HR in an
slow running time(s) fast
10
-2
10
-1
10
0
10
1
10
2
PSNR (dB)
36.4
36.6
36.8
37
37.2
37.4
37.6
VDSR (Ours)
SRCNN
SelfEx
RFL
A+
Figure 1: Our VDSR improves PSNR for scale factor ×2 on
dataset Set5 in comparison to the state-of-the-art methods (SR-
CNN uses the public slower implementation using CPU). VDSR
outperforms SRCNN by a large margin (0.87 dB).
end-to-end manner. Their method, termed SRCNN, does
not require any engineered features that are typically neces-
sary in other methods [
25, 26, 21, 22] and shows the state-
of-the-art performance.
While SRCNN successfully introduced a deep learning
technique into the super-resolution (SR) problem, we find
its limitations in three aspects: first, it relies on the con-
text of small image regions; second, training converges too
slowly; third, the network only works for a single scale.
In this work, we propose a new method to practically
resolve the issues.
Context We utilize contextual information spread over
very large image regions. For a large scale factor, it is often
the case that information contained in a small patch is not
sufficient for detail recovery (ill-posed). Our very deep net-
work using large receptive field takes a large image context
into account.
Convergence We suggest a way to speed-up the train-
ing: residual-learning CNN and extremely high learning
rates. As LR image and HR image share the same infor-
mation to a large extent, explicitly modelling the residual
image, which is the difference between HR and LR images,
is advantageous. We propose a network structure for effi-
1
1646