4 1.2. RESEARCH CHALLENGES
Learning-based methods require data not only for testing and comparison but also
for training, resulting in even higher data requirements. The data used for training
needs to be representative for the network to which the learning-based method
will be applied, possibly requiring generation of new data for each deployment.
Classification-based methods [40,83] require training data that contains nor-
mal data as well as good representatives of those attacks that should be detected,
to be able to separate attacks from normality. Producing a good coverage of the
very large attack space (including unknown attacks) is not practical for any net-
work. Also the data needs to be labelled and attacks to be marked. One advantage
of clustering-based methods [57,84,90,101] is that they require no labelled train-
ing data set containing attacks, significantly reducing the data requirement. There
exist at least two approaches.
When doing unsupervised anomaly detection [57, 90, 101] a model based on
clusters of data is trained using unlabelled data, normal as well as attacks.If
the underlying assumption holds (i.e. attacks are sparse in data) attacks may be
detected based on cluster sizes, where small clusters correspond to attack data.
Unsupervised anomaly detection is a very attractive idea, but unfortunately the
experiences so far indicate that acceptable accuracy is very hard to obtain. Also,
the assumption of unsupervised anomaly detection is not always fulfilled making
the approach unsuitable for attacks such as denial of service (DoS) and scanning.
In the second approach, which we simply denote (pure) anomaly detection in
this thesis, training data is assumed to consist only of normal data. Munson and
Wimer [84] used a cluster-based model (Watcher) to protect a real web server,
proving anomaly detection based on clustering to be useful in real life. The anom-
aly detection algorithm presented here uses pure anomaly detection to reduce the
training data requirement of classification-based methods and to avoid the attack
volume assumption of unsupervised anomaly detection. By including only normal
data in the detection model the low accuracy of unsupervised anomaly detection
can be significantly improved.
In a real live network with connection to the Internet, data can never be as-
sumed to be free of attacks. Pure anomaly detection also works when some attacks
are included in the training data, but those attacks will be considered normal dur-
ing detection and therefore not detected. To increase detection coverage, attacks
should be removed from the training data to as large an extent as possible, with
a trade-off between coverage and data cleaning effort. Attack data can be filtered
away from training data using updated misuse detectors, or multiple anomaly de-
tection models may be combined by voting to reduce costly human effort.
An intrusion detection system in a real-time environment needs to be fast