Under review as a conference paper at ICLR 2016
Figure 1: DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribu-
tion Z is projected to a small spatial extent convolutional representation with many feature maps.
A series of four fractionally-strided convolutions (in some recent papers, these are wrongly called
deconvolutions) then convert this high level representation into a 64 × 64 pixel image. Notably, no
fully connected or pooling layers are used.
suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped
stabilize training.
4.1 LSUN
As visual quality of samples from generative image models has improved, concerns of over-fitting
and memorization of training samples have risen. To demonstrate how our model scales with more
data and higher resolution generation, we train a model on the LSUN bedrooms dataset containing
a little over 3 million training examples. Recent analysis has shown that there is a direct link be-
tween how fast models learn and their generalization performance (Hardt et al., 2015). We show
samples from one epoch of training (Fig.2), mimicking online learning, in addition to samples after
convergence (Fig.3), as an opportunity to demonstrate that our model is not producing high quality
samples via simply overfitting/memorizing training examples. No data augmentation was applied to
the images.
4.1.1 DEDUPLICATION
To further decrease the likelihood of the generator memorizing input examples (Fig.2) we perform a
simple image de-duplication process. We fit a 3072-128-3072 de-noising dropout regularized RELU
autoencoder on 32x32 downsampled center-crops of training examples. The resulting code layer
activations are then binarized via thresholding the ReLU activation which has been shown to be an
effective information preserving technique (Srivastava et al., 2014) and provides a convenient form
of semantic-hashing, allowing for linear time de-duplication . Visual inspection of hash collisions
showed high precision with an estimated false positive rate of less than 1 in 100. Additionally, the
technique detected and removed approximately 275,000 near duplicates, suggesting a high recall.
4.2 FACES
We scraped images containing human faces from random web image queries of peoples names. The
people names were acquired from dbpedia, with a criterion that they were born in the modern era.
This dataset has 3M images from 10K people. We run an OpenCV face detector on these images,
keeping the detections that are sufficiently high resolution, which gives us approximately 350,000
face boxes. We use these face boxes for training. No data augmentation was applied to the images.
4