AIGC综述：跨模态生成与各种数据形式的最新进展

版权申诉

129 浏览量更新于2024-06-14 收藏 2.13MB PDF 举报

随着人工智能技术的飞速发展，AI-generated content (AIGC) 成为了一个热门的研究领域，其目标是利用AI算法创造出包括文本、图像、视频、3D资产等各种形式的媒体。AIGC的发展因其广泛的应用前景和近期研究成果所展现的巨大潜力，吸引了大量的关注。研究人员已经开始针对不同的数据模态开发AIGC方法，以应对各自独特的特性和挑战。首先，对于图像数据，AIGC技术可以生成逼真的图片，通过深度学习模型如GAN（生成对抗网络）或变分自编码器来实现图像合成和风格转换。这些方法不仅能够模仿真实世界中的视觉特征，还能在一定程度上进行创新和想象。视频数据的AIGC涉及连续帧生成，例如动作预测、视频超分辨率和视频摘要。通过递归神经网络(RNN)或卷积神经网络(CNN)与循环结构相结合，模型能够理解视频序列的时序信息，并生成连贯的视频内容。文本模态的AIGC涵盖了文本生成、文本摘要、机器翻译等任务。自然语言处理(NLP)技术，如Transformer架构，被广泛应用，使得AI能够创作出连贯、富有逻辑的文本内容。 3D形状方面，AIGC可以通过voxels（体素）、点云、网格和神经隐式场等形式生成复杂的三维模型。这些方法利用多视角、拓扑结构和深度学习来捕捉和再现三维对象的细节。对于3D场景、3D人体avatar（身体和头部）以及3D运动（如骨骼和avatar动画），AIGC技术允许生成逼真的空间布局、角色模型和动态动作，这在游戏开发、虚拟现实和增强现实等领域具有重大意义。跨模态AIGC则是近年来的一个热点，它允许在一种数据模态下输入条件，然后在另一种模态下生成输出。例如，可以从文本描述转译成图像，从语音转录生成文字，或者从视频剪辑中提取关键帧并生成相应的3D模型或音频。 AIGC的发展旨在突破传统的创作边界，使人工智能成为创意内容的强大工具。随着技术的进步和数据集的扩大，我们有理由期待AIGC在未来将更加精细、多样化，并在各个行业中推动创新。然而，同时也要注意处理好版权、伦理和隐私等问题，确保AI生成的内容既有效又负责任。

AIGC for Various Data Modalities: A Survey • 7

quality, diversity and stability. They are mostly applicable to the conditional setting as well, with some modications. Tab. 2 shows a summary

and comparison between some representative methods.

VGAN [

684

] proposes a GAN for video generation with a spatial-temporal convolutional architecture. TGAN [

571

] uses two separate

generators – a temporal generator to output a set of latent variables and an image generator that takes in these latent variables to produce

frames of the video, and adopts a WGAN [

] model for more stable training. MoCoGAN [

666

] learns to decompose motion and content by

utilizing both video and image discriminators, allowing it to generate videos with the same content and varying motion, and vice versa.

DVD-GAN [

129

] decomposes the discriminator into spatial and temporal components to improve eciency. MoCoGAN-HD [

655

] improves

upon MoCoGAN by introducing a motion generator in the latent space and also a pre-trained and xed image generator, which enables

generation of high-quality videos. VideoGPT [

750

] and Video VQ-VAE [

688

] both explore using VQ-VAE [

677

] to generate videos via discrete

latent variables. DIGAN [

788

] designs a dynamics-aware approach leveraging a neural implicit representation-based GAN [

613

] with temporal

dynamics. StyleGAN-V [

614

] also proposes a continuous-time video generator based on implicit neural representations, and is built upon

StyleGAN2 [

321

]. Brooks et al. [

] aim to generate consistent and realistic long videos and generate videos in two stages: a low-resolution

generation stage and super-resolution stage. Video Diusion [

263

] explores video generation through a DM with a video-based 3D U-Net

[

128

] architecture. Video LDM [

], PVDM [

787

] and VideoFusion [

423

] employ latent diusion models (LDMs) for video generation, showing

a strong ability to generate coherent and long videos at high resolutions. Several recent works [

203

245

774

] also aim to improve the

generation of long videos.

4.2 Conditional Video Generation

Prefix Frame(s)

Inputs

Video Clip

Video-to-video Synthesis

Future Video Prediction

Class-conditional Generation

Conditional Video

Generation Model

Settings for Conditional

Video Generation

Class Information

“Right Hand Wave”

Fig. 4. Illustration of various conditional video generation inputs. Exam-

ples obtained from [84, 164, 475].

In conditional video generation, users can control the content of the gen-

erated video through various inputs. In this subsection, we investigate the

scenario where the input information can be a collection of video frames

(i.e., including single images), or an input class. A summary of these settings

is shown in Fig. 4.

Video-to-Video Synthesis aims to generate a video conditioned on

an input video clip. There are many applications for this task, including

motion transfer and synthetic-to-real synthesis. Most approaches identify

ways to transfer information (e.g., motion information) to the generated

video while maintaining consistency in other aspects (e.g., identities remain

the same).

Vid2vid [

697

] learns a cGAN using paired input and output videos with a spatio-temporal learning objective to learn to map videos

from one visual domain to another. Chan et al. [

] extract the pose sequence from a subject and transfers it to a target subject, which

allows users to synthesize people dancing by transferring the motions from a dancer. Wang et al. [

696

] proposes a few-shot approach for

video-to-video synthesis which requires only a few videos of a new person to perform the transfer of motions between subjects. Mallya et al.

[431] aims to maintain 3D consistency in generated videos by condensing and storing the information of the 3D world as a 3D point cloud,

improving temporal coherence. LIA [

702

] approaches the video-to-video task without explicit structure representations, and achieve it purely

by manipulating the latent space of a GAN [218] for motion transfer.

Future Video Prediction is where a video generative model takes in some prex frames (i.e., an image or video clip), and aim to generate

the future frames.

Srivastava et al. [

628

] proposes to learn video representations with LSTMs in an unsupervised manner via future prediction, which allows

it to generate future frames of a given input video clip. Walker et al. [

687

] adopts a conditional VAE-based approach for self-supervised

learning via future prediction, which can produce multiple predictions for an ambiguous future. Mathieu et al. [

442

] tackles the issue where

blurry predictions are obtained, by introducing an adversarial training approach, a multi-scale architecture, and an image gradient dierence

loss function. PredNet [

416

] designs an architecture where only the deviations of predictions from early network layers are fed to subsequent

network layers, which learns better representations for motion prediction. VPN [

312

] is trained to estimate the joint distribution of the pixel

values in a video and models the factorization of the joint likelihood, which makes the computation of a video’s likelihood fully tractable.

SV2P [

] aims to provide eective stochastic multi-frame predictions for real-world videos via an approach based on stochastic inference.

MD-GAN [

730

] presents a GAN-based approach to generate realistic time-lapse videos given a photo, and produce the videos via a multi-stage

approach for more realistic modeling. Dorkenwald et al. [

164

] uses a conditional invertible neural network to map between the image and

video domain, allowing more control over the video synthesis task.

Class-conditional Generation aims to generate videos containing activities according to a given class label. A prominent example is the

action generation task, where action classes are fed into the video generative model.

CDNA [

183

] proposes an unsupervised approach for learning to generate the motion of physical objects while conditioned on the action

via pixel advection. PSGAN [

755

] leverages human pose as an intermediate representation to guide the generation of videos conditioned on

extracted pose from the input image and an action label. Kim et al. [

334

] adopts an unsupervised approach to train a model to detect the

keypoints of arbitrary objects, which are then used as pseudo-labels for learning the objects’ motions, enabling the generative model to be

applied to datasets without costly annotations of keypoints in the videos. GameGAN [

331

] trains a GAN to generate the next video frame of a

graphics game based on the key pressed by the agent. ImaGINator [

699

] introduces a spatio-temporal cGAN architecture that decomposes

, Vol. 1, No. 1, Article . Publication date: September 2023.

剩余34页未读，继续阅读

百态老人

粉丝: 9711
资源: 2万+

AIGC综述：跨模态生成与各种数据形式的最新进展

AIGC.pdf

Deep Learning in Biometrics.pdf

Knowledge-driven Encode, Retrieve, Paraphrase for MedicalImageReport.pdf

Denoising Diffusion Probabilistic Models.pdf

造影剂分类及优缺点.pdf

Diagnóstico por imagem na osteoporose.pdf

Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization.pdf

多模态神经网络的网络舆情大数据特征识别.pdf

基于标签对齐的多模态一致性表型关联方法.pdf

self.normalization_scheme_per_modality must have as many entries as data has modalities

最新资源