Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA
Controllability using Unconditional GANs. Several methods have
been proposed for editing unconditional GANs by manipulating the
input latent vectors. Some approaches nd meaningful latent direc-
tions via supervised learning from manual annotations or prior 3D
models [Abdal et al
.
2021; Leimkühler and Drettakis 2021; Patashnik
et al
.
2021; Shen et al
.
2020; Tewari et al
.
2020]. Other approaches
compute the important semantic directions in the latent space in
an unsupervised manner [Härkönen et al
.
2020; Shen and Zhou
2020; Zhu et al
.
2023]. Recently, the controllability of coarse object
position is achieved by introducing intermediate “blobs" [Epstein
et al
.
2022] or heatmaps [Wang et al
.
2022b]. All of these approaches
enable editing of either image-aligned semantic attributes such as
appearance, or coarse geometric attributes such as object position
and pose. While Editing-in-Style [Collins et al
.
2020] showcases
some spatial attributes editing capability, it can only achieve this by
transferring local semantics between dierent samples. In contrast
to these methods, our approach allows users to perform ne-grained
control over the spatial attributes using point-based editing.
GANWarping [Wang et al
.
2022a] also use point-based editing,
however, they only enable out-of-distribution image editing. A few
warped images can be used to update the generative model such
that all generated images demonstrate similar warps. However, this
method does not ensure that the warps lead to realistic images.
Further, it does not enable controls such as changing the 3D pose
of the object. Similar to us, UserControllableLT [Endo 2022] en-
ables point-based editing by transforming latent vectors of a GAN.
However, this approach only supports editing using a single point
being dragged on the image and does not handle multiple-point
constraints well. In addition, the control is not precise, i.e., after
editing, the target point is often not reached.
3D-aware GANs. Several methods modify the architecture of the
GAN to enable 3D control [Chan et al
.
2022, 2021; Chen et al
.
2022;
Gu et al
.
2022; Pan et al
.
2021; Schwarz et al
.
2020; Tewari et al
.
2022; Xu et al
.
2022]. Here, the model generates 3D representations
that can be rendered using a physically-based analytic renderer.
However, unlike our approach, control is limited to global pose or
lighting.
Diusion Models. More recently, diusion models [Sohl-Dickstein
et al
.
2015] have enabled image synthesis at high quality [Ho et al
.
2020; Song et al
.
2020, 2021]. These models iteratively denoise a
randomly sampled noise to create a photorealistic image. Recent
models have shown expressive image synthesis conditioned on text
inputs [Ramesh et al
.
2022; Rombach et al
.
2021; Saharia et al
.
2022].
However, natural language does not enable ne-grained control
over the spatial attributes of images, and thus, all text-conditional
methods are restricted to high-level semantic editing. In addition,
current diusion models are slow since they require multiple denois-
ing steps. While progress has been made toward ecient sampling,
GANs are still signicantly more ecient.
2.2 Point Tracking
To track points in videos, an obvious approach is through optical
ow estimation between consecutive frames. Optical ow estimation
is a classic problem that estimates motion elds between two images.
Conventional approaches solve optimization problems with hand-
crafted criteria [Brox and Malik 2010; Sundaram et al
.
2010], while
deep learning-based approaches started to dominate the eld in
recent years due to better performance [Dosovitskiy et al
.
2015;
Ilg et al
.
2017; Teed and Deng 2020]. These deep learning-based
approaches typically use synthetic data with ground truth optical
ow to train the deep neural networks. Among them, the most
widely used method now is RAFT [Teed and Deng 2020], which
estimates optical ow via an iterative algorithm. Recently, Harley
et al. [2022] combines this iterative algorithm with a conventional
“particle video” approach, giving rise to a new point tracking method
named PIPs. PIPs considers information across multiple frames and
thus handles long-range tracking better than previous approaches.
In this work, we show that point tracking on GAN-generated
images can be performed without using any of the aforementioned
approaches or additional neural networks. We reveal that the fea-
ture spaces of GANs are discriminative enough such that tracking
can be achieved simply via feature matching. While some previous
works also leverage the discriminative feature in semantic segmen-
tation [Tritrong et al
.
2021; Zhang et al
.
2021], we are the rst to
connect the point-based editing problem to the intuition of discrim-
inative GAN features and design a concrete method. Getting rid of
additional tracking models allows our approach to run much more
eciently to support interactive editing. Despite the simplicity of
our approach, we show that it outperforms the state-of-the-art point
tracking approaches including RAFT and PIPs in our experiments.
3 METHOD
This work aims to develop an interactive image manipulation method
for GANs where users only need to click on the images to dene
some pairs of (handle point, target point) and drive the handle points
to reach their corresponding target points. Our study is based on
the StyleGAN2 architecture [Karras et al
.
2020]. Here we briey
introduce the basics of this architecture.
StyleGAN Terminology. In the StyleGAN2 architecture, a 512 di-
mensional latent code
𝒛 ∈ N (
0
, 𝑰 )
is mapped to an intermediate
latent code
𝒘 ∈ R
512
via a mapping network. The space of
𝒘
is com-
monly referred to as
W
.
𝒘
is then sent to the generator
𝐺
to produce
the output image
I = 𝐺 (𝒘)
. In this process,
𝒘
is copied several times
and sent to dierent layers of the generator
𝐺
to control dierent
levels of attributes. Alternatively, one can also use dierent
𝒘
for
dierent layers, in which case the input would be
𝒘 ∈ R
𝑙×512
= W
+
,
where
𝑙
is the number of layers. This less constrained
W
+
space is
shown to be more expressive [Abdal et al
.
2019]. As the generator
𝐺
learns a mapping from a low-dimensional latent space to a much
higher dimensional image space, it can be seen as modelling an
image manifold [Zhu et al. 2016].
3.1 Interactive Point-based Manipulation
An overview of our image manipulation pipeline is shown in Fig. 2.
For any image
I ∈ R
3×𝐻×𝑊
generated by a GAN with latent code
𝒘
, we allow the user to input a number of handle points
{𝒑
𝑖
=
(𝑥
𝑝,𝑖
, 𝑦
𝑝,𝑖
)|𝑖 =
1
,
2
, ..., 𝑛}
and their corresponding target points
{𝒕
𝑖
=
(𝑥
𝑡,𝑖
, 𝑦
𝑡,𝑖
)|𝑖 =
1
,
2
, ..., 𝑛}
(i.e., the corresponding target point of
𝒑
𝑖
is
𝒕
𝑖
). The goal is to move the object in the image such that the
3