数据挖掘入门经典：原理与技术详解

需积分: 9 83 浏览量更新于2024-07-31 收藏 3.67MB PDF 举报

《数据挖掘原理》是一本由David Hand、Heikki Mannila和Padhraic Smyth合著的经典之作，作为麻省理工学院的参考教材，它在2001年由The MIT Press出版，共546页。本书深入探讨了如何从大型数据库中提取有用信息的数学与科学基础，对于想要进入数据挖掘领域的学习者来说，是入门的宝贵指南。该书分为多个章节，内容涵盖了数据挖掘的各个方面： 1. 介绍：首先概述了数据挖掘的概念和目标，帮助读者理解其在实际应用中的重要性。 2. 测量与数据：这部分着重于数据的质量评估和预处理，包括数据清洗、缺失值处理等，为后续分析奠定坚实的基础。 3. 数据可视化与探索：通过图表和可视化工具展示数据特征，使得复杂的数据结构变得直观，有助于发现潜在模式。 4. 数据分析与不确定性：讲解了如何处理不确定性，如噪声和异常值，以及概率和统计方法在数据挖掘中的运用。 5. 数据挖掘算法概览：系统地介绍了各类数据挖掘技术，如分类、聚类、关联规则挖掘等核心算法。 6. 模型与模式：讨论了各种模型的构建和解释，以及如何从数据中提炼出有意义的模式。 7. 评分函数：阐述了评价数据挖掘结果的度量标准，如准确率、召回率等，以便优化算法性能。 8. 搜索与优化方法：涉及搜索策略和优化技术，如何在大规模数据中有效地寻找最优解。 9. 描述性建模：侧重于描述性分析，即对已知数据的理解和总结。 10. 预测性建模（分类）：通过学习历史数据，构建分类模型以预测未来的事件或类别。 11. 预测性建模（回归）：探讨连续变量的预测，如预测数值型数据的趋势。 12. 数据组织与数据库：介绍了数据库设计和管理，强调数据存储对挖掘性能的影响。 13. 发现模式与规则：讲解频繁模式挖掘和关联规则发现的方法，揭示数据间的潜在关系。 14. 内容检索：讨论如何通过文本挖掘技术实现基于内容的信息检索。附录部分深入解析随机变量的概念，为理解数据挖掘中的概率和统计理论提供补充。全书还包括参考文献、索引、图例列表、表格列表以及示例列表，以帮助读者全面掌握内容。《数据挖掘原理》不仅适合专业人士，也适合对数据科学有兴趣的读者，通过阅读本书，读者可以建立起扎实的数据挖掘理论基础，并学会将这些技术应用于实际问题中。

Several score functions are widely used for this purpose; these include likelihood, sum of

squared errors, and misclassification rate (the latter is used in supervised classification

problems). For example, the well-known squared error score function is defined as

(1.1)

where we are predicting n "target" values y(i), 1 = i = n, and our predictions for each are

denoted as y(i) (typically this is a function of some other "input" variable values for

prediction and the parameters of the model).

Any views we may have on the theoretical appropriateness of different criteria must be

moderated by the practicality of applying them. The model that we consider to be most

likely to have given rise to the data may be the ideal one, but if estimating its parameters

will take months of computer time it is of little value. Likewise, a score function that is

very susceptible to slight changes in the data may not be very useful (its utility will

depend on the objectives of the study). For example if altering the values of a few

extreme cases leads to a dramatic change in the estimates of some model parameters

caution is warranted; a data set is usually chosen from a number of possible data sets,

and it may be that in other data sets the value of these extreme cases would have

differed. Problems like this can be avoided by using robust methods that are less

sensitive to these extreme points.

1.5.2 Optimization and Search Methods

The score function is a measure of how well aspects of the data match proposed models

or patterns. Usually, these models or patterns are described in terms of a structure,

sometimes with unknown parameter values. The goal of optimization and search is to

determine the structure and the parameter values that achieve a minimum (or maximum,

depending on the context) value of the score function. The task of finding the "best"

values of parameters in models is typically cast as an optimization (or estimation)

problem. The task of finding interesting patterns (such as rules) from a large family of

potential patterns is typically cast as a combinatorial search problem, and is often

accomplished using heuristic search techniques. In linear regression, a prediction rule is

usually found by minimizing a least squares score function (the sum of squared errors

between the prediction from a model and the observed values of the predicted variable).

Such a score function is amenable to mathematical manipulation, and the model that

minimizes it can be found algebraically. In contrast, a score function such as

misclassification rate in supervised classification is difficult to minimize analytically. For

example, since it is intrinsically discontinuous the powerful tool of differential calculus

cannot be brought to bear.

Of course, while we can produce score functions to produce a good match between a

model or pattern and the data, in many cases this is not really the objective. As noted

above, we are often aiming to generalize to new data which might arise (new customers,

new chemicals, etc.) and having too close a match to the data in the database may

prevent one from predicting new cases accurately. We discuss this point later in the

chapter.

1.5.3 Data Management Strategies

The final component in any data mining algorithm is the data management strategy: the

ways in which the data are stored, indexed, and accessed. Most well-known dat a

analysis algorithms in statistics and machine learning have been developed under the

assumption that all individual data points can be accessed quickly and efficiently in

random-access memory (RAM). While main memory technology has improved rapidly,

there have been equally rapid improvements in secondary (disk) and tertiary (tape)

storage technologies, to the extent that many massive data sets still reside largely on

disk or tape and will not fit in available RAM. Thus, there will probably be a price to pay

for accessing massive data sets, since not all data points can be simultaneously close to

the main processor.

Many data analysis algorithms have been developed without including any explicit

specification of a data management strategy. While this has worked in the past on

relatively small data sets, many algorithms (such as classification and regression tree

algorithms) scale very poorly when the "traditional version" is applied directly to data that

reside mainly in secondary storage.

The field of databases is concerned with the development of indexing methods, data

structures, and query algorithms for efficient and reliable data retrieval. Many of these

techniques have been developed to support relatively simple counting (aggregating)

operations on large data sets for reporting purposes. However, in recent years,

development has begun on techniques that support the "primitive" data access

operations necessary to implement efficient versions of data mining algorithms (for

example, tree-structured indexing systems used to retrieve the neighbors of a point in

multiple dimensions).

1.6 The Interacting Roles of Statistics and Data Mining

Statistical techniques alone may not be sufficient to address some of the more

challenging issues in data mining, especially those arising from massive data sets.

Nonetheless, statistics plays a very important role in data mining: it is a necessary

component in any data mining enterprise. In this section we discuss some of the

interplay between traditional statistics and data mining.

With large data sets (and particularly with very large data sets) we may simply not know

even straightforward facts about the data. Simple eye-balling of the data is not an option.

This means that sophisticated search and examination methods may be required to

illuminate features which would be readily apparent in small data sets. Moreover, as we

commented above, often the object of data mining is to make some inferences beyond

the available database. For example, in a database of astronomical objects, we may

want to make a statement that "all objects like this one behave thus," perhaps with an

attached qualifying probability. Likewise, we may determine that particular regions of a

country exhibit certain patterns of telephone calls. Again, it is probably not the calls in the

database about which we want to make a statement. Rather it will probably be the

pattern of future calls which we want to be able to predict. The database provides the set

of objects which will be used to construct the model or search for a pattern, but the

ultimate objective will not generally be to describe those data. In most cases the

objective is to describe the general process by which the data arose, and other data sets

which could have arisen by the same process. All of this means that it is necessary to

avoid models or patterns which match the available database too closely: given that the

available data set is merely one set from the sets of data which could have arisen, one

does not want to model its idiosyncrasies too closely. Put another way, it is necessary to

avoid overfitting the given data set; instead one wants to find models or patterns which

generalize well to potential future data. In selecting a score function for model or pattern

selection we need to take account of this. We will discuss these issues in more detail in

chapter 7 and chapters 9 through 11. While we have described them in a data mining

context, they are fundamental to statistics; indeed, some would take them as the defining

characteristic of statistics as a discipline.

Since statistical ideas and methods are so fundamental to data mining, it is legitimate to

ask whether there are really any differences between the two enterprises. Is data mining

merely exploratory statistics, albeit for potentially huge data sets, or is there more to data

mining than exploratory data analysis? The answer is yes—there is more to data mining.

The most fundamental difference between classical statistical applications and data

mining is the size of the data set. To a conventional statistician, a "large" data set may

contain a few hundred or a thousand data points. To someone concerned with data

mining, however, many millions or even billions of data points is not unexpected—

gigabyte and even terabyte databases are by no means uncommon. Such large

databases occur in all walks of life. For instance the American retailer Wal-Mart makes

over 20 million transactions daily (Babcock, 1994), and constructed an 11 terabyte

database of customer transactions in 1998 (Piatetsky-Shapiro, 1999). AT&T has 100

million customers and carries on the order of 300 million calls a day on its long distance

network. Characteristics of each call are used to update a database of models for every

telephone number in the United States (Cortes and Pregibon, 1998). Harrison (1993)

reports that Mobil Oil aims to store over 100 terabytes of data on oil exploration. Fayyad,

Djorgovski, and Weir (1996) describe the Digital Palomar Observatory Sky Survey as

involving three terabytes of data. The ongoing Sloan Digital Sky Survey will create a raw

observational data set of 40 terabytes, eventually to be reduced to a mere 400 gigabyte

catalog containing 3 × 10

individual sky objects (Szalay et al., 1999). The NASA Earth

Observing System is projected to generate multiple gigabytes of raw data per hour

(Fayyad, Piatetsky-Shapiro, and Smyth, 1996). And the human genome project to

complete sequencing of the entire human genome will likely generate a data set of more

than 3.3 × 10

nucleotides in the process (Salzberg, 1999). With data sets of this size

come problems beyond those traditionally considered by statisticians.

Massive data sets can be tackled by sampling (if the aim is modeling, but not necessarily

if the aim is pattern detection) or by adaptive methods, or by summarizing the records in

terms of sufficient statistics. For example, in standard least squares regression

problems, we can replace the large numbers of scores on each variable by their sums,

sums of squared values, and sums of products, summed over the records—these are

sufficient for regression co-efficients to be calculated no matter how many records there

are. It is also important to take account of the ways in which algorithms scale, in terms of

computation time, as the number of records or variables increases. For example,

exhaustive search through all subsets of variables to find the "best" subset (according to

some score function), will be feasible only up to a point. With p variables there are 2

- 1

possible subsets of variables to consider. Efficient search methods, mentioned in the

previous section, are crucial in pushing back the boundaries here.

Further difficulties arise when there are many variables. One that is important in some

contexts is the curse of dimensionality; the exponential rate of growth of the number of

unit cells in a space as the number of variables increases. Consider, for example, a

single binary variable. To obtain reasonably accurate estimates of parameters within

both of its cells we might wish to have 10 observations per cell; 20 in all. With two binary

variables (and four cells) this becomes 40 observations. With 10 binary variables it

becomes 10240 observations, and with 20 variables it becomes 10485760. The curse of

dimensionality manifests itself in the difficulty of finding accurate estimates of probability

densities in high dimensional spaces without astronomically large databases (so large, in

fact, that the gigabytes available in data mining applications pale into insignificance). In

high dimensional spaces, "nearest" points may be a long way away. These are not

simply difficulties of manipulating the many variables involved, but more fundamental

problems of what can actually be done. In such situations it becomes necessary to

impose additional restrictions through one's prior choice of model (for example, by

assuming linear models).

Various problems arise from the difficulties of accessing very large data sets. The

statistician's conventional viewpoint of a "flat" data file, in which rows represent objects

and columns represent variables, may bear no resemblance to the way the data are

stored (as in the text and Web transaction data sets described earlier). In many cases

the data are distributed, and stored on many machines. Obtaining a random sample from

data that are split up in this way is not a trivial matter. How to define the sampling frame

and how long it takes to access data become important issues.

Worse still, often the data set is constantly evolving—as with, for example, records of

telephone calls or electricity usage. Distributed or evolving data can multiply the size of a

data set many-fold as well as changing the nature of the problems requiring solution.

While the size of a data set may lead to difficulties, so also may other properties not

often found in standard statistical applications. We have already remarked that data

mining is typically a secondary process of data analysis; that is, the data were originally

collected for some other purpose. In contrast, much statistical work is concerned with

primary analysis: the data are collected with particular questions in mind, and then are

analyzed to answer those questions. Indeed, statistics includes subdisciplines of

experimental design and survey design—entire domains of expertise concerned with the

best ways to collect data in order to answer specific questions. When data are used to

address problems beyond those for which they were originally collected, they may not be

ideally suited to these problems. Sometimes the data sets are entire populations (e.g., of

chemicals in a particular class of chemicals) and therefore the standard statistical notion

of inference has no relevance. Even when they are not entire populations, they are often

convenience or opportunity samples, rather than random samples. (For instance,the

records in question may have been collected because they were the most easily

measured, or covered a particular period of time.)

In addition to problems arising from the way the data have been collected, we expect

other distortions to occur in large data sets—including missing values, contamination,

and corrupted data points. It is a rare data set that does not have such problems. Indeed,

some elaborate modeling methods include, as part of the model, a component describing

the mechanism by which missing data or other distortions arise. Alternatively, an

estimation method such as the EM algorithm (described in chapter 8) or an imputation

method that aims to generate artificial data with the same general distributional

properties as the missing data might be used. Of course, all of these problems also arise

in standard statistical applications (though perhaps to a lesser degree with small,

deliberately collected data sets) but basic statistical texts tend to gloss over them.

In summary, while data mining does overlap considerably with the standard exploratory

data analysis techniques of statistics, it also runs into new problems, many of which are

consequences of size and the non traditional nature of the data sets involved.

1.7 Data Mining: Dredging, Snooping, and Fishing

An introductory chapter on data mining would not be complete without reference to the

historical use of terms such as "data mining," "dredging," "snooping," and "fishing." In the

1960s, as computers were increasingly applied to data analysis problems, it was noted

that if you searched long enough, you could always find some model to fit a data set

arbitrarily well. There are two factors contributing to this situation: the complexity of the

model and the size of the set of possible models.

Clearly, if the class of models we adopt is very flexible (relative to the size of the

available data set), then we will probably be able to fit the available data arbitrarily well.

However, as we remarked above, the aim may be to generalize beyond the available

data; a model that fits well may not be ideal for this purpose. Moreover, even if the aim is

to fit the data (for example, when we wish to produce the most accurate summary of data

describing a complete population) it is generally preferable to do this with a simple

model. To take an extreme, a model of complexity equivalent to that of the raw data

would certainly fit it perfectly, but would hardly be of interest or value.

Even with a relatively simple model structure, if we consider enough different models

with this basic structure, we can eventually expect to find a good fit. For example,

consider predicting a response variable, Y from a predictor variable X which is chosen

from a very large set of possible variables, X

, ..., X

, none of which are related to Y. By

virtue of random variation in the data generating process, although there are no

underlying relationships between Y and any of the X variables, there will appear to be

relationships in the data at hand. The search process will then find the X variable that

appears to have the strongest relationship to Y. By this means, as a consequence of the

large search space, an apparent pattern is found where none really exists. The situation

is particularly bad when working with a small sample size n and a large number p of

potential X variables. Familiar examples of this sort of problem include the spurious

correlations which are popularized in the media, such as the "discovery" that over the

past 30 years when the winner of the Super Bowl championship in American football is

from a particular league, a leading stock market index historically goes up in the

following months. Similar examples are plentiful in areas such as economics and the

social sciences, fields in which data are often relatively sparse but models and theories

to fit to the data are relatively plentiful. For instance, in economic time-series prediction,

there may be a relatively short time-span of historical data available in conjunction with a

large number of economic indicators (potential predictor variables). One particularly

humorous example of this type of prediction was provided by Leinweber (personal

communication) who achieved almost perfect prediction of annual values of the well-

known Standard and Poor 500 financial index as a function of annual values from

previous years for butter production, cheese production, and sheep populations in

Bangladesh and the United States.

The danger of this sort of "discovery" is well known to statisticians, who have in the past

labelled such extensive searches "data mining" or "data dredging"—causing these terms

to acquire derogatory connotations. The problem is less serious when the data sets are

large, though dangers remain even then, if the space of potential structures examined is

large enough. These risks are more pronounced in pattern detection than model fitting,

since patterns, by definition, involve relatively few cases (i.e., small sample sizes): if we

examine a billion data points, in search of an unusual configuration of just 50 points, we

have a good chance of detecting this configuration.

There are no easy technical solutions to this problem, though various strategies have

been developed, including methods that split the data into subsamples so that models

can be built and patterns can be detected using one part, and then their validity can be

tested on another part. We say more about such methods in later chapters. The final

answer, however, is to regard data mining not as a simple technical exercise, divorced

from the meaning of the data. Any potential model or pattern should be presented to the

data owner, who can then assess its interest, value, usefulness, and, perhaps above all,

its potential reality in terms of what else is known about the data.

1.8 Summary

Thanks to advances in computers and data capture technology, huge data sets—

containing gigabytes or even terabytes of data—have been and are being collected.

These mountains of data contain potentially valuable information. Th e trick is to extract

that valuable information from the surrounding mass of uninteresting numbers, so that

the data owners can capitalize on it. Data mining is a new discipline that seeks to do just

that: by sifting through these databases, summarizing them, and finding patterns.

Data mining should not be seen as a simple one-time exercise. Huge data collections

may be analyzed and examined in an unlimited number of ways. As time progresses, so

new kinds of structures and patterns may attract interest, and may be worth seeking in

the data.

Data mining has, for good reason, recently attracted a lot of attention: it is a new

technology, tackling new problems, with great potential for valuable commercial and

scientific discoveries. However, we should not expect it to provide answers to all

questions. Like all discovery processes, successful data mining has an element of

serendipity. While data mining provides useful tools, that does not mean that it will

inevitably lead to important, interesting, or valuable results. We must beware of over-

exaggerating the likely outcomes. But the potential is there.

1.9 Further Reading

Brief, general introductions to data mining are given in Fayyad, Piatetsky-Shapiro, and

Smyth (1996), Glymour et al. (1997), and a special issue of the Communications of the

ACM, Vol. 39, No. 11. Overviews of certain aspects of predictive data mining are given

by Adriaans and Zantige (1996) and Weiss and Indurkhya (1998). Witten and Franke

(2000) provide a very readable, applications-oriented account of data mining from a

machine learning (artificial intelligence) perspective and Han and Kamber (2000) is an

accessible textbook written from a database perspective data mining. Th ere are many

texts on data mining aimed at business users, notably Berry and Linoff (1997, 2000) that

contain extensive practical advice on potential business applications of data mining.

Leamer (1978) provides a general discussion of the dangers of data dredging, and Lovell

(1983) provides a general review of the topic. From a statistical perspective. Hendry

(1995, section 15.1) provides an econometrician's view of data mining. Hand et al.

(2000) and Smyth (2000) present comparative discussions of data mining and statistics.

剩余321页未读，继续阅读

youyousky20

粉丝: 0
资源: 5

数据挖掘入门经典：原理与技术详解

大数据挖掘原理与科学

数据挖掘原理：MIT经典教程

数据挖掘原理：MIT权威教程

Principles of data mining

Principles of Data Mining by David Hand

The MIT Press Principles of Data Mining

Principles of Data Mining-3rd.pdf

Principles of Data Mining(非web版)

Principles of Data Mining(MIT).rar

Principles of Data Mining(2nd) 无水印原版pdf

最新资源