8 Cho and Puspitasari et al.
Latent DDPM. To process high-resolution visual input, directly feeding it into the DDPM will incur a surmountable
amount of time during the reverse process. Therefore, the most popular choice in today’s visual generation models is to
shrink the image into a smaller latent representation in advance using an autoencoder. DDPM will then only need to
deal with a much smaller size of the latent representation. This technique is termed latent DDPM [107].
3 TEXT-GUIDED VIDEO GENERATION
Video generation models stem from image generation models, as video is essentially a sequence of images that follows
a certain temporal consistency rule. In this section, rst, we briey introduce how the text-to-image generation models
evolve into text-to-video generation models. We further discuss the underlying framework behind models of each
particular architecture, such as GAN-based, autoregressive-based, and diusion-based models.
3.1 Text-to-Image
The journey from simple text-to-image generation techniques to the creation of state-of-the-art models, which are
capable of producing intricate and realistic images from textual descriptions, is fascinating. Initially, text-to-image
creation relied on rule-based methods that matched text prompts with visual elements from a predened database [
106
].
This further evolved into more complex feature extraction and matching methods that utilized semantic mapping and
basic neural networks [
158
]. Next, the introduction of GAN, particularly conditional GAN (cGAN) [
87
], signicantly
improved image realism as it helps the model to focus on specic textual elements to generate more relevant images.
The improvement in GAN’s text-to-image generation performance was further enhanced with the introduction of an
attention mechanism, as seen in AttnGAN [
146
]. Soon after, these generation models adopt ViT to benet from its
impressive visual generation quality and scalability. OpenAI’s DALL·E [
104
] is one of the widely known examples.
However, most of today’s text-to-image generation models (DALL·E 2, DALL·E 3, and Imagen) utilize the diusion model
for its performance in creating visually detailed and photorealistic output that wins over GAN [
31
]. This progression
highlights the rapid advancement in AI research and its application in generating visually engaging outputs, with each
step marking signicant leaps in the technical sophistication of image generation and the nuanced interpretation of
verbal descriptions.
3.2 Text-to-Video
Text-to-video generation is a subset of conditional video generation that extends the capability of text-to-image
generation [
120
]. The principal concept of this generation is to produce dynamic and contextually rich videos directly
from written descriptions. Initially, this domain relied on simple methods, such as concatenating static images and
word-based animations, where algorithms paired text with pre-existing video clips or sequences [
105
]. However, these
early attempts often resulted in limited and sometimes disjointed outputs. A signicant leap forward came with the
introduction of more advanced machine learning and deep learning techniques.
In the last seven years, text-to-video generation models have been dominated by GAN, autoregressive, and diusion
models. We compiled sixteen representative text-to-video models in Table A1 (Appendix). The evolution of these models
is illustrated in Figure 1. GAN-based models were popular from the end of 2017 to the third quarter of 2022. Many of these
models also integrate recurrent models like RNN and LSTM [
53
] to eectively handle the temporal dynamics. Despite
GAN-based models’ popularity, the research community started to shift their attention to using autoregressive-based
architecture in the generative model. The reason for such an alteration is due to GAN’s limitation in generating video
frames with sharp image quality [
46
,
140
]. Moreover, some models are found to be generating monotonous visual
Manuscript submitted to ACM