ABOUT THIS BOOK

xxii

Top 10 algorithms in data mining

Data and making data-based decisions are so important that even the content of this

book was born out of data—from a paper which was presented at the IEEE Interna-

tional Conference on Data Mining titled, “Top 10 Algorithms in Data Mining” and

appeared in the Journal of Knowledge and Information Systems in December, 2007. This

paper was the result of the award winners from the

KDD conference being asked to

come up with the top 10 machine learning algorithms. The general outline of this

book follows the algorithms identified in the paper. The astute reader will notice this

book has 15 chapters, although there were 10 “important” algorithms. I will explain,

but let’s first look at the top 10 algorithms.

The algorithms listed in that paper are: C4.5 (trees), k-means, support vector

machines, Apriori, Expectation Maximization, PageRank, AdaBoost, k-Nearest Neigh-

bors, Naïve Bayes, and

CART. Eight of these ten algorithms appear in this book, the

notable exceptions being PageRank and Expectation Maximization. PageRank, the

algorithm that launched the search engine giant Google, is not included because I felt

that it has been explained and examined in many books. There are entire books dedi-

cated to PageRank. Expectation Maximization (

EM) was meant to be in the book but

sadly it is not. The main problem with EM is that it’s very heavy on the math, and when

I reduced it to the simplified version, like the other algorithms in this book, I felt that

there was not enough material to warrant a full chapter.

How the book is organized

The book has 15 chapters, organized into four parts, and four appendixes.

Part 1 Machine learning basics

The algorithms in this book do not appear in the same order as in the paper men-

tioned above. The book starts out with an introductory chapter. The next six chapters

in part 1 examine the subject of classification, which is the process of labeling items.

Chapter 2 introduces the basic machine learning algorithm: k-Nearest Neighbors.

Chapter 3 is the first chapter where we look at decision trees. Chapter 4 discusses

using probability distributions for classification and the Naïve Bayes algorithm. Chap-

ter 5 introduces Logistic Regression, which is not in the Top 10 list, but introduces the

subject of optimization algorithms, which are important. The end of chapter 5 also

discusses how to deal with missing values in data. You won’t want to miss chapter 6 as it

discusses the powerful Support Vector Machines. Finally we conclude our discussion

of classification with chapter 7 by looking at the AdaBoost ensemble method. Chapter

7 includes a section that looks at the classification imbalance problem that arises when

the training examples are not evenly distributed.

Part 2 Forecasting numeric values with regression

This section consists of two chapters which discuss regression or predicting continuous

values. Chapter 8 covers regression, shrinkage methods, and locally weighted linear