
4 S. Akcay et al.
the model predicts the future frame of possible standard example, which dis-
tinguishes the abnormality during the inference. In another study on the same
task, Hasan et al. [18] considers a two-stage approach, using local features and
fully connected autoencoder first, followed by fully convolutional autoencoder for
end-to-end feature extraction and classification. Experiments yield competitive
results on anomaly detection benchmarks. To determine the effects of adversar-
ial training in anomaly detection in videos, Dimokranitou [13] uses adversarial
autoencoders, producing a comparable performance on benchmarks.
More recent attention in the literature has been focused on the provision
of adversarial training. The seminal work of Ravanbakhsh et al. [35] utilizes
image to image translation [21] to examine the abnormality detection problem
in crowded scenes and achieves state-of-the-art on the benchmarks. The approach
is to train two conditional GANs. The first generator produces optical flow from
frames, while the second generates frames from optical-flow.
The generalisability of the approach mentioned above is problematic since in
many cases datasets do not have temporal features. One of the most influential
accounts of anomaly detection using adversarial training comes from Schlegl et
al. [39]. The authors hypothesize that the latent vector of the GAN represents
the distribution of the data. However, mapping to the vector space of the GAN
is not straightforward. To achieve this, the authors first train a generator and
discriminator using only normal images. In the next stage, they utilize the pre-
trained generator and discriminator by freezing the weights and remap to the
latent vector by optimizing the GAN based on the z vector. During inference,
the model pinpoints an anomaly by outputting a high anomaly score, reporting
significant improvement over the previous work. The main limitation of this work
is its computational complexity since the model employs a two-stage approach,
and remapping the latent vector is extremely expensive. In a follow-up study,
Zenati et al. [40] investigate the use of BiGAN [14] in an anomaly detection task,
examining joint training to map from image space to latent space simultaneously,
and vice-versa. Training the model via [39] yields superior results on the MNIST
[25] dataset.
Overall prior work strongly supports the hypothesis that the use of autoen-
coders and GAN demonstrate promise in anomaly detection problems [23,39,40].
Motivated by the idea of GAN with inference studied in [39] and [40], we intro-
duce a conditional adversarial network such that generator comprises encoder-
decoder-encoder sub-networks, learning representations in both image and latent
vector space jointly, and achieving state-of-the-art performance both statistically
and computationally.
3 Our Approach: GANomaly
To explain our approach in detail, it is essential to briefly introduce the back-
ground of GAN.
Generative Adversarial Networks (GAN) are an unsupervised machine
learning algorithm that was initially introduced by Goodfellow et al. [16]. The