Image Transformer: 自注意力机制在图像生成中的应用

需积分: 5 104 浏览量更新于2024-08-05 收藏 1.06MB PDF 举报

"这篇文档是2018年的‘Image Transformer’研究论文，该论文主要探讨了如何将自注意力机制应用到图像生成的序列建模中，以实现可计算似然性的图像生成模型。作者包括Niki Parmar、Ashish Vaswani等人，他们提出了一种基于自注意力的Transformer架构的扩展，适用于处理大型图像，并在ImageNet数据集上取得了当时的最佳生成效果。此外，论文还展示了在大放大比例下的图像超分辨率结果，采用了编码器-解码器结构的应用。" 本文档是关于机器学习领域的，特别是集中在深度学习中的图像生成技术。它引入了一个名为Image Transformer的模型，这个模型是受到Transformer架构的启发，Transformer最初在自然语言处理（NLP）领域取得了重大突破。在传统方法中，图像生成通常被视为一个自回归序列生成或转换问题，而Transformer模型则通过自我注意力机制来有效地建模文本序列。 Image Transformer的关键创新在于将自注意力机制扩展到图像生成的序列建模中。自注意力允许模型在不同位置之间建立联系，而无需依赖于传统的卷积神经网络（CNN）的局部连接。通过限制自注意力机制只关注局部邻域，研究者能够在实践中显著增加模型可以处理的图像大小，同时保持比典型CNN层更大的感受野。尽管在概念上简单，但这些生成模型在ImageNet数据集上的表现超越了当时的最优图像生成技术，将最佳公开的负对数似然性从3.83提高到了3.77。这是一个重要的指标，因为它衡量了模型生成图像与真实图像之间的相似度。此外，论文还展示了在图像超分辨率任务上的应用，能够以较大的放大比例恢复图像细节，这进一步证明了模型的泛化能力和在复杂任务上的有效性。这篇论文推动了图像生成技术的发展，特别是在利用自注意力机制来处理视觉数据方面，为后续的深度学习研究和应用提供了重要的参考。

Image Transformer

Niki Parmar *

Ashish Vaswani *

Jakob Uszkoreit

Łukasz Kaiser

Noam Shazeer

Alexander Ku

2 3

Dustin Tran

Abstract

Image generation has been successfully cast as

an autoregressive sequence generation or trans-

formation problem. Recent work has shown that

self-attention is an effective way of modeling tex-

tual sequences. In this work, we generalize a

recently proposed model architecture based on

self-attention, the Transformer, to a sequence

modeling formulation of image generation with

a tractable likelihood. By restricting the self-

attention mechanism to attend to local neighbor-

hoods we signiﬁcantly increase the size of im-

ages the model can process in practice, despite

maintaining signiﬁcantly larger receptive ﬁelds

per layer than typical convolutional neural net-

works. While conceptually simple, our generative

models signiﬁcantly outperform the current state

of the art in image generation on ImageNet, im-

proving the best published negative log-likelihood

on ImageNet from 3.83 to 3.77. We also present

results on image super-resolution with a large

magniﬁcation ratio, applying an encoder-decoder

conﬁguration of our architecture. In a human eval-

uation study, we ﬁnd that images generated by

our super-resolution model fool human observers

three times more often than the previous state of

the art.

1. Introduction

Recent advances in modeling the distribution of natural

images with neural networks allow them to generate increas-

ingly natural-looking images. Some models, such as the

PixelRNN and PixelCNN (van den Oord et al., 2016a), have

Equal contribution. Ordered by coin ﬂip.

Google Brain,

Mountain View, USA

Department of Electrical Engineering and

Computer Sciences, University of California, Berkeley

Work

done during an internship at Google Brain

Google AI, Mountain

View, USA. Correspondence to: Ashish Vaswani, Niki Parmar,

Jakob Uszkoreit

avaswani@google.com, nikip@google.com,

usz@google.com>.

Proceedings of the

International Conference on Machine

by the author(s).

Table 1.

Three outputs of a CelebA super-resolution model fol-

lowed by three image completions by a conditional CIFAR-10

model, with input, model output and the original from left to right

a tractable likelihood. Beyond licensing the comparatively

simple and stable training regime of directly maximizing

log-likelihood, this enables the straightforward application

of these models in problems such as image compression

(van den Oord & Schrauwen, 2014) and probabilistic plan-

ning and exploration (Bellemare et al., 2016).

The likelihood is made tractable by modeling the joint dis-

tribution of the pixels in the image as the product of condi-

tional distributions (Larochelle & Murray, 2011; Theis &

Bethge, 2015). Thus turning the problem into a sequence

modeling problem, the state of the art approaches apply

recurrent or convolutional neural networks to predict each

next pixel given all previously generated pixels (van den

Oord et al., 2016a). Training recurrent neural networks

to sequentially predict each pixel of even a small image

is computationally very challenging. Thus, parallelizable

models that use convolutional neural networks such as the

PixelCNN have recently received much more attention, and

have now surpassed the PixelRNN in quality (van den Oord

et al., 2016b).

One disadvantage of CNNs compared to RNNs is their

typically fairly limited receptive ﬁeld. This can adversely

affect their ability to model long-range phenomena common

in images, such as symmetry and occlusion, especially with

a small number of layers. Growing the receptive ﬁeld has

been shown to improve quality signiﬁcantly (Salimans et al.).

Doing so, however, comes at a signiﬁcant cost in number

arXiv:1802.05751v3 [cs.CV] 15 Jun 2018

下载后可阅读完整内容，剩余9页未读，立即下载

qsdftta

粉丝: 0
资源: 5

Image Transformer: 自注意力机制在图像生成中的应用

image_transformer：实现PSD与PNG等格式转换的Dart Pub包

下载featuretools_sklearn_transformer-0.1.1 Python机器学习包

《Multimodal-GPT-add_baize.zip》技术解读与应用

Hyperspectral_Image_Transformer_Classification_Ne.pdf

Transformer和计算机视觉的跨界组合——DetectionTransformer.pdf

Adversarial Text-to-Image Synthesis A Review.pdf

Neural visUal World creAtion .pdf

Attention Mechanisms in Deep Learning.pdf

飞猪信息流内容推荐探索.pdf

yolo-world的实现方式.pdf

最新资源