4
•
CHAPTER ONE
methods are accumulating. The potential and importance of DM are becoming widely
recognized. In just the last two years the National Science Foundation has poured millions
of dollars into new research initiatives in this area.
DM methods can be applied to quite di erent domains, for example to visual data, in
reading handwriting or recognizing faces within digital pictures. DM is also being
used to analyze texts—for example to classify the content of scienti c papers or other
documents—hence the term text mining. In addition, DM analytics can be applied to
digitized sound, to recognize words in phone conversations, for example. In this book,
however, we focus on the most common domain: the use of DM methods to analyze
quantitative or numerical data.
Miners look for veins of ore and extract these valuable parts from the surrounding rock.
By analogy, data mining looks for patterns or structure in data. But what does it mean to
say that we look for structure in data? Think of a computer screen that displays thousands
of pixels, points of light or dark. Those points are raw data. But if you scan those pixels by
eye and recognize in them the shapes of letters and words, then you are nding structures
in the data—or, to use another metaphor, you are turning data into information.
The equivalent to the computer screen for numerical data is a spreadsheet or matrix,
where each column represents a single variable and each row contains data for a di erent
case or person. Each cell within the spreadsheet contains a speci c value for one person
on one particular variable.
How do you recognize patterns or regularities or structures in this kind of raw numer-
ical data? Statistics provides various ways of expressing the relations between the col-
umns and rows of data in a spreadsheet. The most familiar one is a correlation matrix.
Instead of repeating the raw data, with its thousands of observations and dozens of vari-
ables, a correlation matrix represents just the relations between each variable and each
other variable. It is a summary, a simpli cation of the raw data.
Few of us can read a correlation matrix easily, or recognize a meaningful pattern in it,
so we typically go through a second step in looking for structures in numerical data. We
create a model that summarizes the relations in the correlation matrix. An ordinary least
squares (OLS) regression model is one common example. It translates a correlation
matrix into a much smaller regression equation that we can more easily understand and
interpret.
A statistical model is more than just a summary derived from raw data, though. It
is also a tool for prediction, and it is this second property that makes DM especially use-
ful. Banks accumulate huge databases about customers, including records of who
defaulted on loans. If bank analysts can turn those data into a model to accurately predict
who will default on a loan, then they can reject the riskiest new loan applications
and avoid losses. If Amazon.com can accurately assess your tastes in books, based on
your previous purchases and your similarity to other customers, and then tempt you
with a well-chosen book recommendation, then the company will make more pro t. If a
Attewell - 9780520280977.indd 4Attewell - 9780520280977.indd 4 21/02/15 7:25 PM21/02/15 7:25 PM