![](https://csdnimg.cn/release/download_crawler_static/10494552/bg14.jpg)
2 1. ENSEMBLES DISCOVERED
How can we tell, ahead of time, which algorithm will excel for a given problem? Michie et al.
(1994) addressed this question by executing a similar but larger study (23 algorithms on 22 data
sets) and building a decision tree to predict the best algorithm to use given the properties of a data
set
1
. Though the study was skewed toward trees — they were 9 of the 23 algorithms, and several of
the (academic) data sets had unrealistic thresholds amenable to trees — the study did reveal useful
lessons for algorithm selection (as highlighted in Elder, J. (1996a)).
Still, there is a way to improve model accuracy that is easier and more powerful than judicious
algorithm selection: one can gather models into ensembles. Figure 1.2 reveals the out-of-sample
accuracy of the models of Figure 1.1 when they are combined four different ways, including aver-
aging, voting, and “advisor perceptrons” (Elder and Lee, 1997). While the ensemble technique of
advisor perceptrons beats simple averaging on every problem, the difference is small compared to the
difference between ensembles and the single models. Every ensemble method competes well here
against the best of the individual algorithms.
This phenomenon was discovered by a handful of researchers, separately and simultaneously,
to improve classification whether using decision trees (Ho, Hull, and Srihari, 1990), neural net-
works (Hansen and Salamon, 1990), or math theory (Kleinberg, E., 1990). The most influential
early developments were by Breiman, L. (1996) with Bagging, and Freund and Shapire (1996) with
AdaBoost (both described in Chapter 4).
One of us stumbled across the marvel of ensembling (which we called “model fusion” or
“bundling”) while striving to predict the species of bats from features of their echo-location sig-
nals (Elder, J., 1996b)
2
. We built the best model we could with each of several very different
algorithms, such as decision trees, neural networks, polynomial networks, and nearest neighbors
(see Nisbet et al. (2009) for algorithm descriptions). These methods employ different basis func-
tions and training procedures, which causes their diverse surface forms – as shown in Figure 1.3 –
and often leads to surprisingly different prediction vectors, even when the aggregate performance is
very similar.
The project goal was to classify a bat’s species noninvasively, by using only its “chirps.” Univer-
sity of Illinois Urbana-Champaign biologists captured 19 bats, labeled each as one of 6 species, then
recorded 98 signals, from which UIUC engineers calculated 35 time-frequency features
3
. Figure 1.4
illustrates a two-dimensional projection of the data where each class is represented by a different
color and symbol. The data displays useful clustering but also much class overlap to contend with.
Each bat contributed 3 to 8 signals, and we realized that the set of signals from a given bat had
to be kept together (in either training or evaluation data) to fairly test the model’s ability to predict
a species of an unknown bat.That is, any bat with a signal in the evaluation data must have no other
1
The researchers (Michie et al., 1994, Section 10.6) examined the results of one algorithm at a time and built a C4.5 decision
tree (Quinlan, J., 1992) to separate those datasets where the algorithm was “applicable” (where it was within a tolerance of
the best algorithm) to those where it was not. They also extracted rules from the tree models and used an expert system to
adjudicate between conflicting rules to maximize net “information score.” The book is online at http://www.amsta.leeds.
ac.uk/∼charles/statlog/whole.pdf
2
Thanks to collaboration with Doug Jones and his EE students at the University of Illinois, Urbana-Champaign.
3
Features such as low frequency at the 3-decibel level, time position of the signal peak, and amplitude ratio of 1st and 2nd harmonics.