深度学习驱动的少样本真实神经对话头像合成

需积分: 10 120 浏览量更新于2024-07-15 收藏 2.87MB PDF 举报

"该资源是一份关于深度学习的PDF文档，特别关注了 Few-Shot Adversarial Learning 在创建逼真神经对话头像模型中的应用。文档由三星AI中心和斯科尔科沃科技学院的研究人员撰写，展示了如何利用少量样本进行对抗性学习来生成栩栩如生的人类头部图像。" 在深度学习领域，尤其是计算机视觉和图像生成方面，这篇文档探讨了一种创新方法—— Few-Shot Adversarial Learning（少量样本对抗学习）。这种方法允许模型在有限的训练数据（例如，一个人的几帧视频）上学习创建高度逼真的对话头像模型。传统的技术通常需要大量同一个人的图像数据来训练一个个性化的头像模型，但这种新方法降低了对大规模数据集的依赖。文档中提到，研究人员通过目标帧的面部地标（landmarks）来条件化结果，即使源帧来自不同的视频序列或不同人的面部地标。如图1所示，左侧的模型使用同一人的8帧视频训练，而右侧的模型则采用一拍即合（one-shot）的方式训练。一拍即合学习是一种强化学习策略，它试图在一个单一的学习实例或非常有限的示例集合上快速习得任务。 Few-Shot Adversarial Learning 结合对抗网络（Adversarial Networks），这种网络由两个部分组成：生成器（Generator）和判别器（Discriminator）。生成器负责创建看起来真实的图像，而判别器试图区分真实图像与生成器创建的图像。通过这样的对抗过程，生成器逐渐改进其生成图像的质量，直至判别器无法区分真伪。此外，论文还可能涉及数值计算，这是优化神经网络参数和解决复杂数学问题的关键。在训练过程中，数值计算算法如梯度下降和反向传播被用来更新网络权重，以最小化损失函数并提高模型性能。这篇文档深入研究了深度学习在创造个性化且具有交互性的虚拟头像方面的潜力，尤其在数据稀缺的情况下。这对于虚拟现实、社交媒体、娱乐以及增强现实等领域的应用有着重大意义。通过这种技术，我们可以期望看到更加逼真和互动的数字人类形象在未来得到广泛的应用。

fashion. It is done by simulating episodes of K-shot learn-

ing (K = 8 in our experiments). In each episode, we ran-

domly draw a training video sequence i and a single frame t

from that sequence. In addition to t, we randomly draw ad-

ditional K frames s

, s

, . . . , s

from the same sequence.

We then compute the estimate

of the i-th video embed-

ding by simply averaging the embeddings

) predicted

for these additional frames:

k=1

E (x

), y

); φ) . (1)

A reconstruction

(t) of the t-th frame, based on the

estimated embedding

, is then computed:

(t) = G (y

(t),

; ψ, P) . (2)

The parameters of the embedder and the generator are

then optimized to minimize the following objective that

comprises the content term, the adversarial term, and the

embedding match term:

L(φ, ψ,P, θ, W, w

, b) = L

CNT

(φ, ψ, P)+ (3)

ADV

(φ, ψ, P, θ, W, w

, b) + L

MCH

(φ, W) .

In (3), the content loss term L

CNT

measures the distance

between the ground truth image x

(t) and the reconstruc-

tion

(t) using the perceptual similarity measure [19], cor-

responding to VGG19 [30] network trained for ILSVRC

classiﬁcation and VGGFace [27] network trained for face

veriﬁcation. The loss is calculated as the weighted sum of

losses between the features of these networks.

The adversarial term in (3) corresponds to the realism

score computed by the discriminator, which needs to be

maximized, and a feature matching term [38], which es-

sentially is a perceptual similarity measure, computed using

discriminator (it helps with the stability of the training):

ADV

(φ, ψ, P, θ, W, w

, b) = (4)

−D(

(t), y

(t), i; θ, W, w

, b) + L

Following the projection discriminator idea [32], the

columns of the matrix W contain the embeddings that cor-

respond to individual videos. The discriminator ﬁrst maps

its inputs to an N-dimensional vector V (x

(t), y

(t); θ) and

then computes the realism score as:

(t), y

(t), i; θ, W, w

, b) = (5)

V (

(t), y

(t); θ)

+ w

) + b ,

where W

denotes the i-th column of the matrix W. At the

same time, w

and b do not depend on the video index, so

these terms correspond to the general realism of

(t) and

its compatibility with the landmark image y

(t).

Thus, there are two kinds of video embeddings in our

system: the ones computed by the embedder, and the ones

that correspond to the columns of the matrix W in the dis-

criminator. The match term L

MCH

(φ, W) in (3) encourages

the similarity of the two types of embeddings by penalizing

the L

-difference between

and W

As we update the parameters φ of the embedder and the

parameters ψ of the generator, we also update the parame-

ters θ, W, w

, b of the discriminator. The update is driven

by the minimization of the following hinge loss, which en-

courages the increase of the realism score on real images

(t) and its decrease on synthesized images

(t):

DSC

(φ, ψ, P, θ, W, w

, b) = (6)

max(0, 1 + D(

(t), y

(t), i; φ, ψ, θ, W, w

, b))+

max(0, 1 − D(x

(t), y

(t), i; θ, W, w

, b)) .

The objective (6) thus compares the realism of the fake ex-

ample

(t) and the real example x

(t) and then updates

the discriminator parameters to push these scores below −1

and above +1 respectively. The training proceeds by alter-

nating updates of the embedder and the generator that min-

imize the losses L

CNT

, L

ADV

and L

MCH

with the updates of

the discriminator that minimize the loss L

DSC

3.3. Few-shot learning by ﬁne-tuning

Once the meta-learning has converged, our system can

learn to synthesize talking head sequences for a new person,

unseen during meta-learning stage. As before, the synthe-

sis is conditioned on the landmark images. The system is

learned in a few-shot way, assuming that T training images

x(1), x(2), . . . , x(T ) (e.g. T frames of the same video) are

given and that y(1), y(2), . . . , y(T ) are the corresponding

landmark images. Note that the number of frames T needs

not to be equal to K used in the meta-learning stage.

Naturally, we can use the meta-learned embedder to es-

timate the embedding for the new talking head sequence:

NEW

t=1

E(x(t), y(t); φ) , (7)

reusing the parameters φ estimated in the meta-learning

stage. A straightforward way to generate new frames, corre-

sponding to new landmark images, is then to apply the gen-

erator using the estimated embedding

NEW

and the meta-

learned parameters ψ, as well as projection matrix P. By

doing so, we have found out that the generated images are

plausible and realistic, however, there often is a consider-

able identity gap that is not acceptable for most applications

aiming for high personalization degree.

This identity gap can often be bridged via the ﬁne-tuning

stage. The ﬁne-tuning process can be seen as a simpliﬁed

version of meta-learning with a single video sequence and a

剩余18页未读，继续阅读

「已注销」

粉丝: 0
资源: 1

深度学习驱动的少样本真实神经对话头像合成

《动画概论》A卷期末考试答案.pdf

二上海市计算机一级模拟题.pdf

viewmorping(翻译).pdf

PS古典素描效果.pdf

PHOTOSHOP测试题.pdf

高一会考计算机选择题.pdf

a morphoable model for the synthesis of 3d faces.pdf

5月最新大厂前端高频核心面试题.pdf

UE4游戏引擎在虚拟图书馆中的应用探索.pdf

动态PPT制作软件个好动态PPT用什么软件做.pdf

最新资源