2 CHAPTER 1. DATA MINING
and standar d deviation of this Gaussian distribution completely characterize the
distribution and would become the model of the data. 2
1.1.2 Machine Learning
There are some who regard data mining as synonymous with machine learning.
There is no question that s ome data mining appropriately uses algorithms from
machine learning. Machine-learning practitioners use the data as a training s et,
to train an algorithm of one of the many types used by machine-learning prac -
titioners, such a s Bayes nets, support-vector machines, decision trees, hidden
Markov models, and many others.
There are situations where using data in this way makes se ns e. The typical
case where machine learning is a good approach is when we have little idea of
what we are looking for in the data . For exa mple, it is rather unclear what
it is about movies that makes certain movie-goers like or dislike it. Thus,
in answering the “Netflix challenge” to devise an algorithm that predicts the
ratings of movies by users, based on a sample of their responses, machine-
learning algorithms have proved quite successful. We shall discuss a simple
form of this type of algorithm in Section
9.4.
On the other hand, machine learning has not proved successful in situations
where we can describe the goals of the mining more directly. An interesting
case in point is the attempt by WhizBang! Labs
1
to use machine learning to
locate people’s resumes on the Web. It was not able to do better than algorithms
designed by hand to look for some of the obvious words and phrases that appear
in the typical resume. Since everyone who has looked at or written a r esume has
a pretty good idea of what resumes contain, there was no mystery about what
makes a Web page a resume. Thus, there was no advantage to machine-learning
over the direct design of an algorithm to discover resumes.
1.1.3 Computational Approaches to Modeling
More recently, computer scientists have looked at data mining as an algorithmic
problem. In this case, the model of the data is simply the answer to a complex
query about it. For instance, given the set of numbers of Example
1.1, we mig ht
compute their average and standard deviation. No te that these values might
not be the parameters of the Ga ussian that best fits the data, although they
will almost certainly be very close if the size of the data is large.
There are ma ny different approaches to modeling data. We have already
mentioned the poss ibility of constructing a statistical process whereby the data
could have b e e n generated. Most other approaches to modeling can b e describe d
as either
1. Summarizing the data succinctly and approximately, or
1
This startup attempted to use machine learning to mine large-scale data, and hired many
of the top machine-learning people to do so. Unfortunately, it was not able to sur vive.