索拉：人工智能世界模型推动文本转视频生成的深度解析

版权申诉

121 浏览量更新于2024-06-14 收藏 2.21MB PDF 举报

本文《索拉作为人工智能世界模型：文本到视频生成的完整调查》是一篇深度探讨了当前AI领域中一个重要前沿技术——文本到视频生成的论文。作者包括来自韩国的多位学者，如Joseph Cho、Fachrina Dewipuspitasari、Shengzheng、Jingyao Zheng、Lik-Hang Lee等，以及来自Nota Inc.的Tae-Ho Kim和Kyung Hee University的Choongseon Hong、Chaoning Zhang。文章聚焦于文本到视频技术的演进历程，特别是从传统生成模型向名为"索拉"（Sora）的最新模型的转变。索拉模型作为一种创新，整合了文本到图像合成、视频captioning（视频描述生成）和文本引导编辑的技术进步，显著提升了生成视频的自然度和多样性。它在规模可扩展性和泛化能力方面实现了重大突破，表明了AI在模拟现实世界中的复杂交互能力上的提升。与先前的研究相比，本文的独特之处在于对技术框架和模型进化路径的深入剖析，不仅关注理论层面，还着重于实际应用的探讨。论文涉及如何将这项技术应用于各种场景，如虚拟现实、影视制作、教育和娱乐等领域，并对可能带来的伦理问题进行了深入思考，强调了在推动技术创新的同时，需要负责任地考虑其社会影响。这篇调查论文为读者提供了一个全面的视角，展示了文本到视频生成技术的现状、发展趋势以及面临的挑战，为研究人员、开发者和行业从业者提供了有价值的信息，对于理解人工智能在视觉叙事方面的潜力和局限具有重要意义。

Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation 7

the current generation result only depends on the direct preceding generation, much like an autoregressive model.

During the forward process, because the model only needs to generate Gaussian noise, simply adding small Gaussian

noise alone is sucient without any learnable parameters. Thus, the output of the forward process can be a cumulative

distribution of all previous forward processes.

𝑞(x

1:𝑇

) :=

𝑇

𝑡=1

𝑞(x

𝑡

𝑡 −1

)

This eventually means that the forward process can be done with a single step, creating noise directly from the input

image.

𝑞(x

𝑡

) = N (x

𝑡

;



𝛼

𝑡

, (1 − 𝛼

𝑡

)I)

where

𝛼

𝑡

− 𝛽

𝑡

and

𝛼

𝑡

𝑖=1

𝛼

𝑖

. However, the same principle cannot be applied in the reverse process as the

true distribution of the desired visual output is unknown. Thus, the reverse process can only approximate the output

distribution with learnable parameters.

𝑝

𝜃

𝑡 −1

𝑡

) = N (x

𝑡 −1

; 𝜇

𝜃

𝑡

, 𝑡), Σ

𝜃

𝑡

, 𝑡))

Although such approximation sounds similar to what’s done in VAE, both are dierent in terms of the underlying true

data distribution. The distribution of target data in VAE is completely unknown, while in the diusion model’s reverse

process, the target data

𝑝

𝜃

𝑡 −1

𝑡

)

can be modeled as Gaussian distribution because, during the forward process,

the noise addition function

𝑞(x

𝑡 −1

𝑡

)

is Gaussian due to the innitesimal amount of noise added per step. For this

reason, instead of minimizing KL divergence [

115

] between the posterior distribution and Gaussian function as done in

VAE, the objective function of the reverse process only needs to minimize the KL divergence between two Gaussian

distributions, making it more simple and stable to train.

𝐷

𝐾𝐿

(𝑝 ∥ 𝑞) =

𝛽

𝑡

2𝜎

𝑡

𝛼

𝑡

(1 − 𝛼

𝑡

)

∥ 𝝐

𝑡

− 𝝐

𝜃

(



𝛼

𝑡



1 − 𝛼

𝑡

𝝐

𝑡

, 𝑡) ∥

Which was simplied further to become the following loss function.

𝐿

𝑠𝑖𝑚𝑝𝑙𝑒

𝑡

= E

𝑡∼[1,𝑇 ],x

,𝝐

𝑡

∥ 𝝐

𝑡

− 𝝐

𝜃

(



𝛼

𝑡



1 − 𝛼

𝑡

𝝐

𝑡

, 𝑡) ∥

Nevertheless, diusion model also has limitations, particularly for its slow generation process as the model requires

multiple sampling steps during training. Fortunately, such a drawback is solved by denoising diusion implicit model

(DDIM) [

119

], which tried to generalize DDPM using non-Markovian assumption by allowing for deterministic and

skip sampling.

Classier-free Guidance. Vanilla DDPM or DDIM doesn’t oer the exibility to control the type of visual output.

For this reason, these vanilla models can be integrated with guidance to steer the sampling process only for a certain

class. The earliest approach is simply to couple the model

𝝐

𝜃

𝑡

, 𝑡)

with a pre-trained classier

𝝐

𝜃

𝑡

, 𝑡, 𝑦)

that takes a

large number of classes [

]. Using this classier, class guidance is approximated using the gradient of the classier.

Eventually, because of this approximation, it is possible to obtain the guidance without a pre-trained classier as the

guidance will be trained together with the neural network of the reverse process [

]. Moreover, about 10% of training

time will involve training with a null class 𝝐

𝜃

𝑡

, 𝑡) = 𝝐

𝜃

𝑡

, 𝑡, 𝑦 = ∅).

Manuscript submitted to ACM

8 Cho and Puspitasari et al.

Latent DDPM. To process high-resolution visual input, directly feeding it into the DDPM will incur a surmountable

amount of time during the reverse process. Therefore, the most popular choice in today’s visual generation models is to

shrink the image into a smaller latent representation in advance using an autoencoder. DDPM will then only need to

deal with a much smaller size of the latent representation. This technique is termed latent DDPM [107].

3 TEXT-GUIDED VIDEO GENERATION

Video generation models stem from image generation models, as video is essentially a sequence of images that follows

a certain temporal consistency rule. In this section, rst, we briey introduce how the text-to-image generation models

evolve into text-to-video generation models. We further discuss the underlying framework behind models of each

particular architecture, such as GAN-based, autoregressive-based, and diusion-based models.

3.1 Text-to-Image

The journey from simple text-to-image generation techniques to the creation of state-of-the-art models, which are

capable of producing intricate and realistic images from textual descriptions, is fascinating. Initially, text-to-image

creation relied on rule-based methods that matched text prompts with visual elements from a predened database [

106

This further evolved into more complex feature extraction and matching methods that utilized semantic mapping and

basic neural networks [

158

]. Next, the introduction of GAN, particularly conditional GAN (cGAN) [

], signicantly

improved image realism as it helps the model to focus on specic textual elements to generate more relevant images.

The improvement in GAN’s text-to-image generation performance was further enhanced with the introduction of an

attention mechanism, as seen in AttnGAN [

146

]. Soon after, these generation models adopt ViT to benet from its

impressive visual generation quality and scalability. OpenAI’s DALL·E [

104

] is one of the widely known examples.

However, most of today’s text-to-image generation models (DALL·E 2, DALL·E 3, and Imagen) utilize the diusion model

for its performance in creating visually detailed and photorealistic output that wins over GAN [

]. This progression

highlights the rapid advancement in AI research and its application in generating visually engaging outputs, with each

step marking signicant leaps in the technical sophistication of image generation and the nuanced interpretation of

verbal descriptions.

3.2 Text-to-Video

Text-to-video generation is a subset of conditional video generation that extends the capability of text-to-image

generation [

120

]. The principal concept of this generation is to produce dynamic and contextually rich videos directly

from written descriptions. Initially, this domain relied on simple methods, such as concatenating static images and

word-based animations, where algorithms paired text with pre-existing video clips or sequences [

105

]. However, these

early attempts often resulted in limited and sometimes disjointed outputs. A signicant leap forward came with the

introduction of more advanced machine learning and deep learning techniques.

In the last seven years, text-to-video generation models have been dominated by GAN, autoregressive, and diusion

models. We compiled sixteen representative text-to-video models in Table A1 (Appendix). The evolution of these models

is illustrated in Figure 1. GAN-based models were popular from the end of 2017 to the third quarter of 2022. Many of these

models also integrate recurrent models like RNN and LSTM [

] to eectively handle the temporal dynamics. Despite

GAN-based models’ popularity, the research community started to shift their attention to using autoregressive-based

architecture in the generative model. The reason for such an alteration is due to GAN’s limitation in generating video

frames with sharp image quality [

140

]. Moreover, some models are found to be generating monotonous visual

Manuscript submitted to ACM

剩余35页未读，继续阅读

百态老人

粉丝: 1w+
资源: 2万+

索拉：人工智能世界模型推动文本转视频生成的深度解析

索拉·洛亚：现代电子商务店面模板

索拉菲尼耐药机制研究：MAPK信号通路与ERK1/2在肝癌中的作用

Roxas Complete Mod：全面模型交换与GoA Randomizer兼容

索拉

森索拉

索拉莫尔

索拉布

(完整版)索拉机组余热锅炉系统简介2.pdf

ConsoleApp1:康索拉山

美国索拉燃气轮机.doc

最新资源