16 CHAPTER 1. A COMMAND-LINE PRIMER
1.2.2 Classifier
Any lear ning algorithm in WE KA is derived from the abstract weka.classifiers.Classifier
class. Surprisingly little is needed for a basic classifier: a routine which gen-
erates a classifier model from a training dataset (= buildClassifier) and
another routine which evaluates the generated model on an unseen test da taset
(= classifyInstance), or generates a probability distribution for all classes
(= distributionForInstance).
A classifier model is an ar bitrary complex mapping from all-but-one dataset
attributes to the class attribute. The specific form and creation of this map-
ping, or model, differs from classifier to classifier. For e xample, ZeroR’s (=
weka.classifiers.rules.ZeroR) model just consists of a single value: the
most common class, or the median of all numeric values in case of predicting a
numer ic value (= regression learning). ZeroR is a trivia l classifier, but it gives a
lower bound on the per fo rmance of a g iven dataset which should be significantly
improved by more complex classifiers. As such it is a reasonable test on how
well the class can b e predicted without considering the other attributes.
Later, we will explain how to interpret the output from classifiers in detail –
for now just focus on the Correctly Classified Instances in the section Stratified
cross-validation and notice how it improves from ZeroR to J48:
java weka.classifiers.rules.ZeroR -t weather.arff
java weka.classifiers.trees.J48 -t weather.arff
There are various approaches to determine the performance of classifiers. The
performance can most simply be measured by counting the proportion of cor-
rectly predicted examples in an unseen test dataset. This value is the accuracy,
which is also 1-ErrorRate. Both terms are used in literature.
The simplest case is using a training set and a test set which are mutually
independent. This is referred to as hold-out estimate. To estimate variance in
these performance estimates, hold-out estimates may be computed by repeatedly
resampling the same dataset – i.e. randomly reordering it and then splitting it
into training and test sets with a specific proportio n of the examples, collecting
all estimates on test data and computing average and standard deviation of
accuracy.
A more elaborate method is cross-validation. Here, a number of folds n is
sp e c ified. The dataset is randomly reordered and then split into n folds o f equal
size. In each iter ation, one fold is used for testing and the other n- 1 folds are
used for training the classifier. The test results are collected a nd average d over
all folds. This gives the cross-validation estimate of the accuracy. The folds can
be purely random o r slig htly modified to create the same class distributions in
each fold as in the complete data set. In the latter case the cross-validation is
called stratified. Leave-one-out (= loo) cross -validation signifies that n is equal
to the number of examples. Out of necessity, loo cv has to be non-stratified,
i.e. the class distributions in the test set are not related to those in the training
data. Therefore loo cv tends to give less reliable results. However it is still
quite useful in dealing with small datas e ts since it utilizes the greatest amount
of training data from the dataset.