Image Transformer
Niki Parmar *
1
Ashish Vaswani *
1
Jakob Uszkoreit
1
Łukasz Kaiser
1
Noam Shazeer
1
Alexander Ku
2 3
Dustin Tran
4
Abstract
Image generation has been successfully cast as
an autoregressive sequence generation or trans-
formation problem. Recent work has shown that
self-attention is an effective way of modeling tex-
tual sequences. In this work, we generalize a
recently proposed model architecture based on
self-attention, the Transformer, to a sequence
modeling formulation of image generation with
a tractable likelihood. By restricting the self-
attention mechanism to attend to local neighbor-
hoods we significantly increase the size of im-
ages the model can process in practice, despite
maintaining significantly larger receptive fields
per layer than typical convolutional neural net-
works. While conceptually simple, our generative
models significantly outperform the current state
of the art in image generation on ImageNet, im-
proving the best published negative log-likelihood
on ImageNet from 3.83 to 3.77. We also present
results on image super-resolution with a large
magnification ratio, applying an encoder-decoder
configuration of our architecture. In a human eval-
uation study, we find that images generated by
our super-resolution model fool human observers
three times more often than the previous state of
the art.
1. Introduction
Recent advances in modeling the distribution of natural
images with neural networks allow them to generate increas-
ingly natural-looking images. Some models, such as the
PixelRNN and PixelCNN (van den Oord et al., 2016a), have
*
Equal contribution. Ordered by coin flip.
1
Google Brain,
Mountain View, USA
2
Department of Electrical Engineering and
Computer Sciences, University of California, Berkeley
3
Work
done during an internship at Google Brain
4
Google AI, Mountain
View, USA. Correspondence to: Ashish Vaswani, Niki Parmar,
Jakob Uszkoreit
<
avaswani@google.com, nikip@google.com,
usz@google.com>.
Proceedings of the
35
th
International Conference on Machine
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018
by the author(s).
Table 1.
Three outputs of a CelebA super-resolution model fol-
lowed by three image completions by a conditional CIFAR-10
model, with input, model output and the original from left to right
a tractable likelihood. Beyond licensing the comparatively
simple and stable training regime of directly maximizing
log-likelihood, this enables the straightforward application
of these models in problems such as image compression
(van den Oord & Schrauwen, 2014) and probabilistic plan-
ning and exploration (Bellemare et al., 2016).
The likelihood is made tractable by modeling the joint dis-
tribution of the pixels in the image as the product of condi-
tional distributions (Larochelle & Murray, 2011; Theis &
Bethge, 2015). Thus turning the problem into a sequence
modeling problem, the state of the art approaches apply
recurrent or convolutional neural networks to predict each
next pixel given all previously generated pixels (van den
Oord et al., 2016a). Training recurrent neural networks
to sequentially predict each pixel of even a small image
is computationally very challenging. Thus, parallelizable
models that use convolutional neural networks such as the
PixelCNN have recently received much more attention, and
have now surpassed the PixelRNN in quality (van den Oord
et al., 2016b).
One disadvantage of CNNs compared to RNNs is their
typically fairly limited receptive field. This can adversely
affect their ability to model long-range phenomena common
in images, such as symmetry and occlusion, especially with
a small number of layers. Growing the receptive field has
been shown to improve quality significantly (Salimans et al.).
Doing so, however, comes at a significant cost in number
arXiv:1802.05751v3 [cs.CV] 15 Jun 2018