of time complexity, memory complexity and the required number of
labelled records.
One-class Support Vector Machines (1SVMs) [8–10] are a
popular technique for unsupervised anomaly detection. Generally,
they aim to model the underlying distribution of normal data
while being insensitive to noise or anomalies in the training
records. A kernel function implicitly maps the input space to a
higher dimensional feature space to make a clearer separation
between normal and anomalous data. When properly applied, in
principle a kernel-based method is able to model any non-linear
pattern of normal behaviour. For clarity in the rest of the paper, the
notation of 1SVM is used to denote (an unsupervised) one-class
SVM; lSVMs — short for labeled SVM — to denote (supervised)
binary and multi-class SVM classifiers; and SVMs when both
1SVMs and lSVMs are considered.
SVMs are theoretically appealing for the following reasons
[11,12]: they provide good generalisation when the parameters are
appropriately configured, even if the training set has some bias;
they deliver a unique solution, since the loss function is convex;
and in principal they can model any training set, when an
appropriate kernel is chosen.
In practice, however, training SVMs is memory and time
intensive. SVMs are non-parametric learning models, whose
complexity grows quadratically with the number of records [13].
They are best suited to small datasets with many features, and so
far large-scale training on high-dimensional records (e.g.,
10
6
10
4
) has been limited with SVMs [14]. Large numbers of
input features result in the curse of dimensionality phenomenon,
which causes the generalisation error of shallow architectures
(discussed in Section 2.1), such as SVMs, to increase with the
number of irrelevant and redundant features. The curse of
dimensionality implies that to obtain good generalisation, the
number of training samples must grow exponentially with the
number of features [14,4,15]. Furthermore, shallow architectures
have practical limitations for efficient representation of certain
types of function families [16]. To avoid these major issues, it is
essential to generate a model that can capture the large degree of
variation that occurs in the underlying data pattern, without
having to enumerate all of them. Therefore, a compact repre-
sentation of the data that captures most of the variation can
alleviate the curse of dimensionality as well as reducing the
computational complexity of the algorithm [16,17].
An alternative class of classification algorithms that have
emerged in recent years are Deep Belief Nets (DBNs), which have
been proposed as a multi-class classifier and dimensionality
reduction tool [18–20]. DBNs are multi-layer generative models
that learn one layer of features at a time from unlabelled data. The
extracted features are then treated as the input for training the
next layer. This efficient, greedy learning can be followed by fine-
tuning the weights to improve the generative or discriminative
performance of the whole network.
DBNs have a deep architecture, composed of multiple layers of
parameterised non-linear modules. There are a range of advanta-
geous properties that have been identified for DBNs [16]: they can
learn higher-level features that yield good classification accuracy;
they are parametric models, whose training time scales linearly
with the number of records; they can use unlabelled data to learn
from complex and high-dimensional datasets.
A major limitation of DBNs is that their loss function is non-
convex, therefore the model often converges on local minima and
there is no guarantee that the global minimum will be found. In
addition, DBN classifiers are semi-supervised algorithms, and
require some labelled examples for discriminative fine-tuning,
hence an unsupervised generative model of DBNs, known as
autoencoders, are used for anomaly detection.
The open research problem we address is how to overcome the
limitations of one-class SVM architectures on complex, high-
dimensional datasets. We propose the use of DBNs as a feature
reduction stage for one-class SVMs, to give a hybrid anomaly
detection architecture. While a variety of feature reduction
methods — i.e., feature selection and feature extraction methods —
have been considered for SVMs (e.g., [21–25] — see [26] for a
survey) none have studied the use of DBNs as a method for deep
feature construction in the context of anomaly detection, i.e., with
a one-class SVM. In this paper, we design and evaluate a new
architecture for anomaly detection in high-dimensional domains.
To the best of our knowledge, this is the first method proposed for
combining DBNs with one-class SVMs to improve their perfor-
mance for anomaly detection.
The contributions of this paper are two-fold. The performance
of DBNs against one-class SVMs is evaluated for detecting
anomalies in complex high-dimensional data. In contrast, the
reported results in the literature from DBN classification perfor-
mance only cover multi-class classification, e.g., [14,27–29]. A novel
unsupervised anomaly detection model is also proposed, which
combines the advantages of deep belief nets with one-class SVMs.
In our proposed model an unsupervised DBN is trained to extract
features that are reasonably insensitive to irrelevant variations in
the input, and a 1SVM is trained on the feature vectors produced
by the DBN. More specifically, for anomaly detection we show that
computationally expensive non-linear kernel machines can be
replaced by linear ones, when aggregated with a DBN. To the best
of our knowledge, this is the first time these frameworks have
been combined this way. The result of experiments conducted on
several benchmark datasets demonstrate that our hybrid model
yields significant performance improvements over the stand-alone
systems. The combination of the hybrid DBN-1SVM avoids the
complexity of non-linear kernel machines, and reaches the accu-
racy of a state-of-the-art autoencoder while considerably lowering
its training and testing time.
The remainder of the paper is organised as follows. Section 2
begins with an introduction to deep architectures and their strengths
and weaknesses compared to their shallow counterparts. Then it
review s some of the leading 1SVM methods, and motiv ates the
req uir ements for the hybrid model by considering the shortcomings
of SVMs for processing large datasets. Section 3 presents our pro-
posed unsupervised anomaly detection approach DBN-1SVM. Section
4 describes the empirical analysis and provides a detailed statistical
comparison of the performance of autoencoder, 1SVM and DBN-
1SVM models on various real-wor ld and synthetic datasets. It
demonstrates the advantages of the DBN-1SVM architectur e in terms
of both accuracy and computational efficiency. Section 5 summarises
the paper and outlines future research.
2. Background
2.1. Shallow and deep architectures
Classification techniques with shallow architectures typically
comprise an input layer together with a single layer of processing.
Kernel machines such as SVMs, for example, are a layer of kernel
functions that are applied to the input, followed by a linear com-
bination of the kernel outputs. In contrast, deep architectures are
composed of several layers of nonlinear processing nodes. The
widely used form of the latter type of architectures are multi-layer
neural networks with multiple hidden layers.
While shallow architectures offer important advantages when
optimising the parameters of the model, such as using convex loss
functions, they suffer from limitations in terms of providing an
efficient representation for certain types of function families. In
S.M. Erfani et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎2
Please cite this article as: S.M. Erfani, et al., High-dimensional and large-scale anomaly detection using a linear one-class SVM with
deep learning, Pattern Recognition (2016), http://dx.doi.org/10.1016/j.patcog.2016.03.028i