大规模语言模型驱动的超写实文本到图像扩散模型

需积分: 5 117 浏览量更新于2024-06-22 收藏 10.84MB PDF 举报

标题 "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding" 这篇论文探讨了如何将深度语言理解与文本到图像的生成模型相结合，以实现前所未有的照片级真实感。作者Chitwan Saharia等人来自Google Research的Brain Team，他们开发了一种名为Imagen的模型，该模型利用大型Transformer语言模型的强大文本理解能力，并借助扩散模型在高保真图像生成方面的优势。核心内容包括： 1. **方法创新**：研究者提出了一种基于Transformer语言模型（如T5）的文字到图像生成模型。这些预训练在纯文本语料库上的模型表现出令人惊讶的效果，能够在图像合成中有效地编码文本信息。 2. **技术结合**：论文强调了将深度语言理解和扩散模型（如稳定扩散模型Stable Diffusion）的有效结合。通过这种方法，模型能够更好地理解和生成与输入文本高度相关的图像，而非单纯依赖图像生成模型的大小来提升性能。 3. **效果提升**：实验结果显示，增大语言模型的规模对提高样本质量和图像文本一致性的影响远超过增加图像扩散模型的大小。这表明，对于图像生成任务，语言模型的重要性不容忽视。 4. **模型性能**：模型Imagen在照片级真实感和语言理解方面达到了新的高度，这在文本到图像转换领域具有重要的里程碑意义。 5. **潜在应用**：这项工作可能对各种需要高质量图像生成和精准文本描述的应用产生深远影响，比如虚拟现实、艺术创作、图像搜索引擎等。 "Photorealistic Text-to-Image Diffusion Models" 是一项关于如何利用深度学习技术，尤其是Transformer语言模型和扩散模型的优势，以生成高度逼真且符合文本描述的图像的研究。论文的重点在于揭示了语言模型在图像生成中的关键作用，挑战了传统上对图像模型尺寸的过度依赖，为文本驱动的图像生成技术开辟了新的发展方向。

Alignment Fidelity

50%

100%

Imagen

DALL-E 2

Alignment Fidelity

Imagen

GLIDE

Alignment Fidelity

Imagen

VQGAN+CLIP

Alignment Fidelity

Imagen

Latent Diffusion

Figure 3: Comparison between Imagen and DALL-E 2 [

], GLIDE [

], VQ-GAN+CLIP [

]

and Latent Diffusion [

] on DrawBench: User preference rates (with 95% conﬁdence intervals) for

image-text alignment and image ﬁdelity.

0.22 0.24 0.26 0.28

CLIP Score

FID-10K

T5-Small

T-Large

T5-XL

T5-XXL

(a) Impact of encoder size.

0.24 0.25 0.26 0.27 0.28 0.29

CLIP Score

FID-10K

300M

500M

(b) Impact of U-Net size.

0.26 0.27 0.28 0.29

CLIP Score

FID@10K

static thresholding

dynamic thresholding

Figure 4: Summary of some of the critical ﬁndings of Imagen with pareto curves sweeping over

different guidance values. See Appendix D for more details.

Scaling text encoder size is more important than U-Net size.

While scaling the size of the diffusion

model U-Net improves sample quality, we found scaling the text encoder size to be signiﬁcantly more

impactful than the U-Net size (Fig. 4b).

Dynamic thresholding is critical.

We show that dynamic thresholding results in samples with

signiﬁcantly better photorealism and alignment with text, over static or no thresholding, especially

under the presence of large classiﬁer-free guidance weights (Fig. 4c).

Human raters prefer T5-XXL over CLIP on DrawBench.

The models trained with T5-XXL and

CLIP text encoders perform similarly on the COCO validation set in terms of CLIP and FID scores.

However, we ﬁnd that human raters prefer T5-XXL over CLIP on DrawBench across all 11 categories.

Noise conditioning augmentation is critical.

We show that training the super-resolution models

with noise conditioning augmentation leads to better CLIP and FID scores. We also show that

noise conditioning augmentation enables stronger text conditioning for the super-resolution model,

resulting in improved CLIP and FID scores at higher guidance weights. Adding noise to the low-res

image during inference along with the use of large guidance weights allows the super-resolution

models to generate diverse upsampled outputs while removing artifacts from the low-res image.

Text conditioning method is critical.

We observe that conditioning over the sequence of text

embeddings with cross attention signiﬁcantly outperforms simple mean or attention based pooling in

both sample ﬁdelity as well as image-text alignment.

Efﬁcient U-Net is critical.

Our Efﬁcient U-Net implementation uses less memory, converges faster,

and has better sample quality with faster inference.

5 Related Work

Diffusion models have seen wide success in image generation [

], outperforming

GANs in ﬁdelity and diversity, without training instability and mode collapse issues [

Autoregressive models [

], GANs [

], VQ-VAE Transformer-based methods [

], and

diffusion models have seen remarkable progress in text-to-image [

], including the concurrent

DALL-E 2

[

], which uses a diffusion prior on CLIP text latents and cascaded diffusion models

to generate high resolution

1024 × 1024

images; we believe Imagen is much simpler, as Imagen

does not need to learn a latent prior, yet achieves better results in both MS-COCO FID and human

evaluation on DrawBench. GLIDE [

] also uses cascaded diffusion models for text-to-image, but

we use large pretrained frozen language models, which we found to be instrumental to both image

ﬁdelity and image-text alignment. XMC-GAN [

] also uses BERT as a text encoder, but we scale to

much larger text encoders and demonstrate the effectiveness thereof. The use of cascaded models is

also popular throughout the literature [

] and has been used with success in diffusion models to

generate high resolution images [16, 29].

6 Conclusions, Limitations and Societal Impact

Imagen showcases the effectiveness of frozen large pretrained language models as text encoders for

the text-to-image generation using diffusion models. Our observation that scaling the size of these

language models have signiﬁcantly more impact than scaling the U-Net size on overall performance

encourages future research directions on exploring even bigger language models as text encoders.

Furthermore, through Imagen we re-emphasize the importance of classiﬁer-free guidance, and we

introduce dynamic thresholding, which allows usage of much higher guidance weights than seen

in previous works. With these novel components, Imagen produces

1024 × 1024

samples with

unprecedented photorealism and alignment with text.

Our primary aim with Imagen is to advance research on generative methods, using text-to-image

synthesis as a test bed. While end-user applications of generative methods remain largely out of

scope, we recognize the potential downstream applications of this research are varied and may impact

society in complex ways. On the one hand, generative models have a great potential to complement,

extend, and augment human creativity [

]. Text-to-image generation models, in particular, have

the potential to extend image-editing capabilities and lead to the development of new tools for

creative practitioners. On the other hand, generative methods can be leveraged for malicious purposes,

including harassment and misinformation spread [

], and raise many concerns regarding social and

cultural exclusion and bias [

]. These considerations inform our decision to not to release

code or a public demo. In future work we will explore a framework for responsible externalization

that balances the value of external auditing with the risks of unrestricted open-access.

Another ethical challenge relates to the large scale data requirements of text-to-image models, which

have have led researchers to rely heavily on large, mostly uncurated, web-scraped datasets. While this

approach has enabled rapid algorithmic advances in recent years, datasets of this nature have been

critiqued and contested along various ethical dimensions. For example, public and academic discourse

regarding appropriate use of public data has raised concerns regarding data subject awareness and

consent [

]. Dataset audits have revealed these datasets tend to reﬂect social stereotypes,

oppressive viewpoints, and derogatory, or otherwise harmful, associations to marginalized identity

groups [

]. Training text-to-image models on this data risks reproducing these associations

and causing signiﬁcant representational harm that would disproportionately impact individuals and

communities already experiencing marginalization, discrimination and exclusion within society. As

such, there are a multitude of data challenges that must be addressed before text-to-image models like

Imagen can be safely integrated into user-facing applications. While we do not directly address these

challenges in this work, an awareness of the limitations of our training data guide our decision not to

release Imagen for public use. We strongly caution against the use text-to-image generation methods

for any user-facing tools without close care and attention to the contents of the training dataset.

Imagen’s training data was drawn from several pre-existing datasets of image and English alt-text pairs.

A subset of this data was ﬁltered to removed noise and undesirable content, such as pornographic

imagery and toxic language. However, a recent audit of one of our data sources, LAION-400M [

uncovered a wide range of inappropriate content including pornographic imagery, racist slurs, and

harmful social stereotypes [

]. This ﬁnding informs our assessment that Imagen is not suitable for

public use at this time and also demonstrates the value of rigorous dataset audits and comprehensive

dataset documentation (e.g. [

]) in informing consequent decisions about the model’s appropriate

and safe use. Imagen also relies on text encoders trained on uncurated web-scale data, and thus

inherits the social biases and limitations of large language models [5, 3, 50].

While we leave an in-depth empirical analysis of social and cultural biases encoded by Imagen to

future work, our small scale internal assessments reveal several limitations that guide our decision

not to release Imagen at this time. First, all generative models, including Imagen, Imagen, may

run into danger of dropping modes of the data distribution, which may further compound the social

consequence of dataset bias. Second, Imagen exhibits serious limitations when generating images

depicting people. Our human evaluations found Imagen obtains signiﬁcantly higher preference

rates when evaluated on images that do not portray people, indicating a degradation in image

ﬁdelity. Finally, our preliminary assessment also suggests Imagen encodes several social biases and

stereotypes, including an overall bias towards generating images of people with lighter skin tones and

a tendency for images portraying different professions to align with Western gender stereotypes. Even

when we focus generations away from people, our preliminary analysis indicates Imagen encodes a

range of social and cultural biases when generating images of activities, events, and objects.

While there has been extensive work auditing image-to-text and image labeling models for forms of

social bias (e.g. [

]), there has been comparatively less work on social bias evaluation methods

for text-to-image models, with the recent exception of [

]. We believe this is a critical avenue for

future research and we intend to explore benchmark evaluations for social and cultural bias in future

work—for example, exploring whether it is possible to generalize the normalized pointwise mutual

information metric [

] to the measurement of biases in image generation models. There is also a

great need to develop a conceptual vocabulary around potential harms of text-to-image models that

could guide the development of evaluation metrics and inform responsible model release. We aim to

address these challenges in future work.

7 Acknowledgements

We give thanks to Ben Poole for reviewing our manuscript, early discussions, and providing many

helpful comments and suggestions throughout the project. Special thanks to Kathy Meier-Hellstern,

Austin Tarango, and Sarah Laszlo for helping us incorporate important responsible AI practices

around this project. We appreciate valuable feedback and support from Elizabeth Adkison, Zoubin

Ghahramani, Jeff Dean, Yonghui Wu, and Eli Collins. We are grateful to Tom Small for designing the

Imagen watermark. We thank Jason Baldridge, Han Zhang, and Kevin Murphy for initial discussions

and feedback. We acknowledge hard work and support from Fred Alcober, Hibaq Ali, Marian Croak,

Aaron Donsbach, Tulsee Doshi, Toju Duke, Douglas Eck, Jason Freidenfelds, Brian Gabriel, Molly

FitzMorris, David Ha, Philip Parham, Laura Pearce, Evan Rapoport, Lauren Skelly, Johnny Soraker,

Negar Rostamzadeh, Vijay Vasudevan, Tris Warkentin, Jeremy Weinstein, and Hugh Williams for

giving us advice along the project and assisting us with the publication process. We thank Victor

Gomes and Erica Moreira for their consistent and critical help with TPU resource allocation. We

also give thanks to Shekoofeh Azizi, Harris Chan, Chris A. Lee, and Nick Ma for volunteering a

considerable amount of their time for testing out DrawBench. We thank Aditya Ramesh, Prafulla

Dhariwal, and Alex Nichol for allowing us to use DALL-E 2 samples and providing us with GLIDE

samples. We are thankful to Matthew Johnson and Roy Frostig for starting the JAX project and to the

whole JAX team for building such a fantastic system for high-performance machine learning research.

Special thanks to Durk Kingma, Jascha Sohl-Dickstein, Lucas Theis and the Toronto Brain team for

helpful discussions and spending time Imagening!

References

[1]

Osman Aka, Ken Burke, Alex Bauerle, Christina Greer, and Margaret Mitchell. Measur-

ing Model Biases in the Absence of Ground Truth. In Proceedings of the 2021 AAAI/ACM

Conference on AI, Ethics, and Society, 2021.

[2]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint

arXiv:1607.06450, 2016.

[3]

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On

the dangers of stochastic parrots: Can language models be too big? . In Proceedings of FAccT

2021, 2021.

[4]

Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misog-

yny, pornography, and malignant stereotypes. In arXiv:2110.01963, 2021.

[5]

Shikha Bordia and Samuel R. Bowman. Identifying and Reducing Gender Bias in Word-Level

Language Models. In NAACL, 2017.

剩余45页未读，继续阅读

电子云与长程纠缠

粉丝: 4462
资源: 20

大规模语言模型驱动的超写实文本到图像扩散模型

FlexNeRF: Photorealistic Free-viewpoint Rendering of Moving Humans from Sparse Views 里面的time stamp是什么

Photorealistic Video Generation with Diffusion Models.pdf

LPTN:论文'High-Resolution Photorealistic Image Translation in Real-Time'的官方实现

Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images代码中的106-index文件

论文：Full Face-and-Head 3D Model With Photorealistic Texture

A Non-Photorealistic Lighting Model For Automatic Technical Illustration.pdf

3d photorealistic clouds

Infinite Photorealistic Worlds using Procedural Generation

Ultrafast Photorealistic Style Transfer via Neural Architecture

Photorealistic lights IES 1.2.5.unitypackage

最新资源