fashion. It is done by simulating episodes of K-shot learn-
ing (K = 8 in our experiments). In each episode, we ran-
domly draw a training video sequence i and a single frame t
from that sequence. In addition to t, we randomly draw ad-
ditional K frames s
1
, s
2
, . . . , s
K
from the same sequence.
We then compute the estimate
ˆ
e
i
of the i-th video embed-
ding by simply averaging the embeddings
ˆ
e
i
(s
k
) predicted
for these additional frames:
ˆ
e
i
=
1
K
K
X
k=1
E (x
i
(s
k
), y
i
(s
k
); φ) . (1)
A reconstruction
ˆ
x
i
(t) of the t-th frame, based on the
estimated embedding
ˆ
e
i
, is then computed:
ˆ
x
i
(t) = G (y
i
(t),
ˆ
e
i
; ψ, P) . (2)
The parameters of the embedder and the generator are
then optimized to minimize the following objective that
comprises the content term, the adversarial term, and the
embedding match term:
L(φ, ψ,P, θ, W, w
0
, b) = L
CNT
(φ, ψ, P)+ (3)
L
ADV
(φ, ψ, P, θ, W, w
0
, b) + L
MCH
(φ, W) .
In (3), the content loss term L
CNT
measures the distance
between the ground truth image x
i
(t) and the reconstruc-
tion
ˆ
x
i
(t) using the perceptual similarity measure [19], cor-
responding to VGG19 [30] network trained for ILSVRC
classification and VGGFace [27] network trained for face
verification. The loss is calculated as the weighted sum of
L
1
losses between the features of these networks.
The adversarial term in (3) corresponds to the realism
score computed by the discriminator, which needs to be
maximized, and a feature matching term [38], which es-
sentially is a perceptual similarity measure, computed using
discriminator (it helps with the stability of the training):
L
ADV
(φ, ψ, P, θ, W, w
0
, b) = (4)
−D(
ˆ
x
i
(t), y
i
(t), i; θ, W, w
0
, b) + L
FM
.
Following the projection discriminator idea [32], the
columns of the matrix W contain the embeddings that cor-
respond to individual videos. The discriminator first maps
its inputs to an N-dimensional vector V (x
i
(t), y
i
(t); θ) and
then computes the realism score as:
D(
ˆ
x
i
(t), y
i
(t), i; θ, W, w
0
, b) = (5)
V (
ˆ
x
i
(t), y
i
(t); θ)
T
(W
i
+ w
0
) + b ,
where W
i
denotes the i-th column of the matrix W. At the
same time, w
0
and b do not depend on the video index, so
these terms correspond to the general realism of
ˆ
x
i
(t) and
its compatibility with the landmark image y
i
(t).
Thus, there are two kinds of video embeddings in our
system: the ones computed by the embedder, and the ones
that correspond to the columns of the matrix W in the dis-
criminator. The match term L
MCH
(φ, W) in (3) encourages
the similarity of the two types of embeddings by penalizing
the L
1
-difference between
ˆ
e
i
and W
i
.
As we update the parameters φ of the embedder and the
parameters ψ of the generator, we also update the parame-
ters θ, W, w
0
, b of the discriminator. The update is driven
by the minimization of the following hinge loss, which en-
courages the increase of the realism score on real images
x
i
(t) and its decrease on synthesized images
ˆ
x
i
(t):
L
DSC
(φ, ψ, P, θ, W, w
0
, b) = (6)
max(0, 1 + D(
ˆ
x
i
(t), y
i
(t), i; φ, ψ, θ, W, w
0
, b))+
max(0, 1 − D(x
i
(t), y
i
(t), i; θ, W, w
0
, b)) .
The objective (6) thus compares the realism of the fake ex-
ample
ˆ
x
i
(t) and the real example x
i
(t) and then updates
the discriminator parameters to push these scores below −1
and above +1 respectively. The training proceeds by alter-
nating updates of the embedder and the generator that min-
imize the losses L
CNT
, L
ADV
and L
MCH
with the updates of
the discriminator that minimize the loss L
DSC
.
3.3. Few-shot learning by fine-tuning
Once the meta-learning has converged, our system can
learn to synthesize talking head sequences for a new person,
unseen during meta-learning stage. As before, the synthe-
sis is conditioned on the landmark images. The system is
learned in a few-shot way, assuming that T training images
x(1), x(2), . . . , x(T ) (e.g. T frames of the same video) are
given and that y(1), y(2), . . . , y(T ) are the corresponding
landmark images. Note that the number of frames T needs
not to be equal to K used in the meta-learning stage.
Naturally, we can use the meta-learned embedder to es-
timate the embedding for the new talking head sequence:
ˆ
e
NEW
=
1
T
T
X
t=1
E(x(t), y(t); φ) , (7)
reusing the parameters φ estimated in the meta-learning
stage. A straightforward way to generate new frames, corre-
sponding to new landmark images, is then to apply the gen-
erator using the estimated embedding
ˆ
e
NEW
and the meta-
learned parameters ψ, as well as projection matrix P. By
doing so, we have found out that the generated images are
plausible and realistic, however, there often is a consider-
able identity gap that is not acceptable for most applications
aiming for high personalization degree.
This identity gap can often be bridged via the fine-tuning
stage. The fine-tuning process can be seen as a simplified
version of meta-learning with a single video sequence and a
4