DALLE：零样本文本到图像生成的Transformer方法

需积分: 0 117 浏览量更新于2024-06-16 收藏 9.66MB PDF 举报

"AIGC论文-DALLE-Zero-Shot Text-to-Image Generation" 这篇论文探讨了零样本（Zero-Shot）文本到图像生成技术，这是一种新兴的AI领域，旨在通过模型学习生成与给定文本描述相匹配的图像，无需预先对特定领域进行训练。传统的文本到图像生成方法通常依赖于复杂的架构、辅助损失函数或在训练时提供的额外信息，如对象部件标签或分割掩模。而DALLE（可能是指Discrete Autoencoder Language Image Model）提出了一种简单的方法，该方法基于自回归变压器模型，能够将文本和图像令牌视为单一数据流进行建模。 1. 引言现代文本到图像合成的研究起始于Mansimov等人(2015)的工作，他们展示了DRAW模型（Gregor等人，2015）如何在条件为图像标题的情况下生成新的视觉场景。Reed等人(2016b)随后证明，使用生成对抗网络（Goodfellow等人，2014）而非递归神经网络，可以进一步提升图像生成的质量。这些早期工作为后续研究奠定了基础，但仍然需要针对特定领域进行大量训练。 2. 方法 DALLE模型的核心是自回归Transformer，它能够处理文本和图像的联合表示。这种模型在大规模数据集上训练，能够学习到丰富的语义表示，使得在零样本情况下也能生成高质量图像。Transformer的自回归性质允许模型逐像素地预测图像，同时考虑到整个序列的上下文信息。 3. 零样本生成零样本文本到图像生成的关键在于模型的泛化能力。传统方法通常需要在特定类型的图像上进行训练，然后在该类型内生成图像。然而，DALLE在训练时没有特定领域的限制，因此可以在未见过的新概念或类别上进行生成，这大大扩展了其应用范围。 4. 实验与评估论文中的实验部分可能包括了对比DALLE与其他特定领域模型在零样本设置下的性能。这些比较可能通过多种指标进行，如人类评估、生成图像的多样性以及与输入文本的对应度等。 5. 结论 DALLE模型展示了在大规模数据集上训练的自回归Transformer在零样本文本到图像生成任务上的潜力。这种方法不仅简化了模型设计，而且在没有额外领域信息的情况下，能够生成具有竞争力的图像，这对于AI生成艺术、可视化工具和增强现实等领域具有重要意义。 6. 展望尽管DALLE取得了显著的进步，但零样本生成仍有挑战，例如模型可能无法完全理解某些抽象或复杂的概念，或者在处理多模态信息时可能存在的偏差。未来的研究可能会探索如何进一步提高模型的泛化能力和生成质量，以及如何更好地结合其他AI技术，如推理和多模态理解，来推动这一领域的进展。

Zero-Shot Text-to-Image Generation

smaller reconstruction error at the end of training.

2.2. Stage Two: Learning the Prior

In the second stage, we ﬁx

and

, and learn the prior

distribution over the text and image tokens by maximizing

the ELB with respect to

. Here,

is represented by a

12-billion parameter sparse transformer (Child et al., 2019).

Given a text-image pair, we BPE-encode (Sennrich et al.,

2015) the lowercased caption using at most 256 tokens

with vocabulary size

16,384

, and encode the image using

32 × 32 = 1024

tokens with vocabulary size

8192

. The

image tokens are obtained using argmax sampling from the

dVAE encoder logits, without adding any gumbel noise.

Finally, the text and image tokens are concatenated and

modeled autoregressively as a single stream of data.

The transformer is a decoder-only model in which each im-

age token can attend to all text tokens in any one of its 64

self-attention layers. The full architecture is described in Ap-

pendix B.1. There are three different kinds of self-attention

masks used in the model. The part of the attention masks

corresponding to the text-to-text attention is the standard

causal mask, and the part for the image-to-image attention

uses either a row, column, or convolutional attention mask.

We limit the length of a text caption to 256 tokens, though

it is not totally clear what to do for the “padding” positions

in between the last text token and the start-of-image token.

One option is to set the logits for these tokens to

−∞

in the

self-attention operations. Instead, we opt to learn a special

padding token separately for each of the 256 text positions.

This token is used only when no text token is available. In

preliminary experiments on Conceptual Captions (Sharma

et al., 2018), we found that this resulted in higher validation

loss, but better performance on out-of-distribution captions.

We normalize the cross-entropy losses for the text and image

This is contrary to the usual tradeoff between the two terms.

We speculate that for smaller values of

, the noise from the

relaxation causes the optimizer to reduce codebook usage toward

the beginning of training, resulting in worse ELB at convergence.

During training, we apply 10% BPE dropout (Provilkov et al.,

2019), whose use is common in the neural machine translation

literature.

Strictly speaking, Equation 1 requires us to sample from

the categorical distribution speciﬁed by the dVAE encoder log-

its, rather than taking the argmax. In preliminary experiments on

ImageNet, we found that this was a useful regularizer in the overpa-

rameterized regime, and allows the transformer to be trained using

soft targets for the cross-entropy loss. We decided against this

here since the model in consideration is in the underparameterized

regime.

We found using a single attention operation for all three inter-

actions – “text attends to text”, “image attends to text”, and “image

attends to image” – to perform better than using separate attention

operations that are independently normalized.

Figure 4.

Illustration of per-resblock gradient scaling for a trans-

former resblock. The solid line indicates the sequence of opera-

tions for forward propagation, and the dashed line the sequence of

operations for backpropagation. We scale the incoming gradient

for each resblock by its gradient scale, and unscale the outgoing

gradient before it is added to the sum of the gradients from the suc-

cessive resblocks. The activations and gradients along the identity

path are stored in 32-bit precision. The “ﬁlter” operation sets all

Inf and NaN values in the activation gradient to zero. Without this,

a nonﬁnite event in the current resblock would cause the gradient

scales for all preceding resblocks to unnecessarily drop, thereby

resulting in underﬂow.

tokens by the total number of each kind in a batch of data.

Since we are primarily interested in image modeling, we

multiply the cross-entropy loss for the text by

1/8

and the

cross-entropy loss for the image by

7/8

. The objective is

optimized using Adam with exponentially weighted iterate

averaging; Appendix B.2 describes the training procedure

in more detail. We reserved about

606,000

images for vali-

dation, and found no signs of overﬁtting at convergence.

2.3. Data Collection

Our preliminary experiments for models up to

1.2

billion pa-

rameters were carried out on Conceptual Captions, a dataset

of 3.3 million text-image pairs that was developed as an

extension to MS-COCO (Lin et al., 2014).

To scale up to

-billion parameters, we created a dataset of

a similar scale to JFT-300M (Sun et al., 2017) by collecting

250 million text-images pairs from the internet. This dataset

does not include MS-COCO, but does include Conceptual

Captions and a ﬁltered subset of YFCC100M (Thomee et al.,

2016). As MS-COCO was created from the latter, our train-

ing data includes a fraction of the MS-COCO validation

images (but none of the captions). We control for this in the

quantitative results presented in Section 3 and ﬁnd that it has

no appreciable bearing on the results. We provide further

剩余19页未读，继续阅读

丁希希哇

粉丝: 3184
资源: 2

DALLE：零样本文本到图像生成的Transformer方法

从PyPI官网获取dalle-pytorch-1.1.2资源包

深度学习库dalle-pytorch-0.0.4发布，Python编程者必看

PyPI官方发布dalle-pytorch深度学习库

DALLE-jp

dalle-开源

DALLE-reproduction:重现OpenAI的DALLE模型

dalle-main.zip

clipdist-server:支持分布式DALLE-pytorch数据集创建的服务器

Python库 | dalle-pytorch-0.0.4.tar.gz

dalle-service:从服务

最新资源