Pluralistic Image Completion
Chuanxia Zheng Tat-Jen Cham Jianfei Cai
School of Computer Science and Engineering
Nanyang Technological University, Singapore
{chuanxia001,astjcham,asjfcai}@ntu.edu.sg
Figure 1. Example completion results of our method on images of a face, a building, and natural scenery with various masks (missing
regions shown in white). For each group, the masked input image is shown left, followed by sampled results from our model without any
post-processing. The results are diverse and plausible. (Zoom in to see the details.)
Abstract
Most image completion methods produce only one result
for each masked input, although there may be many reason-
able possibilities. In this paper, we present an approach for
pluralistic image completion – the task of generating mul-
tiple and diverse plausible solutions for image completion.
A major challenge faced by learning-based approaches is
that usually only one ground truth training instance per la-
bel. As such, sampling from conditional VAEs still leads
to minimal diversity. To overcome this, we propose a novel
and probabilistically principled framework with two paral-
lel paths. One is a reconstructive path that utilizes the only
one given ground truth to get prior distribution of missing
parts and rebuild the original image from this distribution.
The other is a generative path for which the conditional
prior is coupled to the distribution obtained in the recon-
structive path. Both are supported by GANs. We also in-
troduce a new short+long term attention layer that exploits
distant relations among decoder and encoder features, im-
proving appearance consistency. When tested on datasets
with buildings (Paris), faces (CelebA-HQ), and natural im-
ages (ImageNet), our method not only generated higher-
quality completion results, but also with multiple and di-
verse plausible outputs.
1. Introduction
Image completion is a highly subjective process. Sup-
posing you were shown the various images with missing
regions in fig. 1, what would you imagine to be occupying
these holes? Bertalmio et al. [4] related how expert con-
servators would inpaint damaged art by: 1) imagining the
semantic content to be filled based on the overall scene; 2)
ensuring structural continuity between the masked and un-
masked regions; and 3) filling in visually realistic content
for missing regions. Nonetheless, each expert will indepen-
dently end up creating substantially different details, even if
they may universally agree on high-level semantics, such as
general placement of eyes on a damaged portrait.
Based on this observation, our main goal is thus to gen-
erate multiple and diverse plausible results when presented
with a masked image — in this paper we refer to this task
as pluralistic image completion (depicted in fig. 1). This
is as opposed to approaches that attempt to generate only a
single “guess” for missing parts.
Early image completion works [4, 7, 5, 8, 3, 13] fo-
cus only on steps 2 and 3 above, by assuming that gaps
should be filled with similar content to that of the back-
ground. Although these approaches produced high-quality
texture-consistent images, they cannot capture global se-
mantics and hallucinate new content for large holes. More
recently, some learning-based image completion methods
[29, 14, 39, 40, 42, 24, 38] were proposed that infer seman-
4321
1438