ABOUT THIS BOOK
xxii
Top 10 algorithms in data mining
Data and making data-based decisions are so important that even the content of this
book was born out of data—from a paper which was presented at the IEEE Interna-
tional Conference on Data Mining titled, “Top 10 Algorithms in Data Mining” and
appeared in the Journal of Knowledge and Information Systems in December, 2007. This
paper was the result of the award winners from the
KDD conference being asked to
come up with the top 10 machine learning algorithms. The general outline of this
book follows the algorithms identified in the paper. The astute reader will notice this
book has 15 chapters, although there were 10 “important” algorithms. I will explain,
but let’s first look at the top 10 algorithms.
The algorithms listed in that paper are: C4.5 (trees), k-means, support vector
machines, Apriori, Expectation Maximization, PageRank, AdaBoost, k-Nearest Neigh-
bors, Naïve Bayes, and
CART. Eight of these ten algorithms appear in this book, the
notable exceptions being PageRank and Expectation Maximization. PageRank, the
algorithm that launched the search engine giant Google, is not included because I felt
that it has been explained and examined in many books. There are entire books dedi-
cated to PageRank. Expectation Maximization (
EM) was meant to be in the book but
sadly it is not. The main problem with EM is that it’s very heavy on the math, and when
I reduced it to the simplified version, like the other algorithms in this book, I felt that
there was not enough material to warrant a full chapter.
How the book is organized
The book has 15 chapters, organized into four parts, and four appendixes.
Part 1 Machine learning basics
The algorithms in this book do not appear in the same order as in the paper men-
tioned above. The book starts out with an introductory chapter. The next six chapters
in part 1 examine the subject of classification, which is the process of labeling items.
Chapter 2 introduces the basic machine learning algorithm: k-Nearest Neighbors.
Chapter 3 is the first chapter where we look at decision trees. Chapter 4 discusses
using probability distributions for classification and the Naïve Bayes algorithm. Chap-
ter 5 introduces Logistic Regression, which is not in the Top 10 list, but introduces the
subject of optimization algorithms, which are important. The end of chapter 5 also
discusses how to deal with missing values in data. You won’t want to miss chapter 6 as it
discusses the powerful Support Vector Machines. Finally we conclude our discussion
of classification with chapter 7 by looking at the AdaBoost ensemble method. Chapter
7 includes a section that looks at the classification imbalance problem that arises when
the training examples are not evenly distributed.
Part 2 Forecasting numeric values with regression
This section consists of two chapters which discuss regression or predicting continuous
values. Chapter 8 covers regression, shrinkage methods, and locally weighted linear