数据挖掘：预处理关键步骤

需积分: 10 12 浏览量更新于2024-07-30 收藏 4.01MB PDF 举报

"数据挖掘预处理" 由 Dorian Pyle 撰写，是一本关于在数据挖掘领域中如何进行有效数据准备的著作。这本书由 Morgan Kaufmann Publishers, Inc. 出版，涉及了数据预处理的重要性和各种相关技术。在数据挖掘的过程中，数据预处理是至关重要的一步，它直接影响到后续分析结果的质量和准确性。数据预处理主要包括以下几个关键知识点： 1. 数据清洗：这是预处理的第一步，涉及到消除数据集中的错误、不完整、不一致或冗余的数据。这包括处理缺失值（通过插补或删除）、消除噪声数据（如异常值检测）以及校正数据输入错误。 2. 数据集成：当数据来自多个源时，需要将它们合并到一个统一的视图中。这可能涉及到解决数据不一致性和冲突，以及处理重复的数据记录。 3. 数据转换：数据通常需要转换成适合挖掘的格式。这包括标准化（例如，z-score标准化或min-max缩放），归一化，以及将分类数据编码为数值形式。 4. 数据规约：对于大型数据集，为了提高处理效率，可能需要通过降维方法（如主成分分析PCA）或聚类来减少数据的复杂性。这有助于降低计算成本，同时保持足够的信息量。 5. 数据离散化：将连续数据转化为离散数据，如区间划分或基于频数的分割，可以简化数据分析，同时增强某些数据挖掘算法的性能。 6. 数据采样：当数据量过大时，可以选择性地抽取一部分代表性的子集进行分析，以减少计算负担，同时保持总体数据的特性。 7. 特征选择：通过评估特征对目标变量的影响，挑选出最相关的特征，以减少模型的复杂性和提高预测准确性。 8. 构建数据挖掘友好的结构：根据所使用的数据挖掘算法，可能需要将数据转换成特定的结构，如决策树、关联规则或神经网络所需的格式。这些预处理步骤是数据挖掘项目的基础，确保了输入到模型中的数据质量和适用性。没有经过适当预处理的数据，可能会导致模型的性能下降，甚至得出误导性的结论。因此，数据预处理是任何数据科学项目中不可或缺的一部分。

Surveying the Data

Modeling the Data

This is the “map of the territory” that you should keep in mind as we visit each area and

discuss issues. Figure 1.1 illustrates this map and shows how long each stage typically

takes. It also shows the relative importance of each stage to the success of the project.

Eighty percent of the importance to success comes from finding a suitable problem to

address, defining what success looks like in the form of a solution, and, most critical of all,

implementing the solution. If the final results are not implemented, it i

s impossible for any

project to be successful. On the other hand, mining—preparation, surveying, and

modeling—traditionally takes most of the time in any project. However, after the

importance of actually implementing the result, the two most important contributors to

success are solving an appropriate problem and preparing the data. While implementing

the result is of the first importance to success, it is almost invariably outside the scope of

the data exploration project itself. As such, implementation u

sually requires organizational

or procedural changes inside an organization, which is well outside the scope of this

discussion. Nonetheless, implementation is critical, since without implementing the results

there can be no success.

Figure 1.1 Stages of a data exploration project showing importance and duration

of each stage.

1.1.1 Stage 1: Exploring the Problem Space

This is a critical place to start. It is also the place that, without question, is the source of

most of the misunderstandings and unrealistic expectations from data mining. Quite aside

from the fact that the terms “data exploration” and “data mining” are (incorrectly) used

interchangeably, data mining has been described as “a worm that crawls through your

data and finds golden nuggets.” It has also been described as “a method of automatically

extracting unexpected hidden patterns from data.” It is hard to see any analogous

connection between either data exploration or data mining and metaphorical worms. As

for automatically extracting hidden and unexpected patterns, there is some analogous

truth

to that statement. The real problem is that it gives no flavor for what goes into finding

those hidden patterns, why you would look for them, nor any idea of how to practically use

them when they are found. As a statement, it makes data mining appear to ex

ist in a world

where such things happen by themselves. This leads to “the expectation of magic” from

data mining: wave a magic wand over the data and produce answers to questions you

didn’t even know you had!

Without question, effective data exploration provides a disciplined approach to identifying

business problems and gaining an understanding of data to help solve them. Absolutely

no magic used, guaranteed.

Identifying Problems

The data exploration process starts by identifying the right problems to solve. This is not

as easy as it seems. In one instance, a major telecommunications company insisted that

they had already identified their problem. They were quite certain that the problem was

churn. They listened patiently to the explanat

ion of the data exploration methodology, and

then, deciding it was irrelevant in this case (since they were sure they already understood

the problem), requested a model to predict churn. The requested churn model was duly

built, and most effective it was t

oo. The company’s previous methods yielded about a 50%

accurate prediction model. The new model raised the accuracy of the churn predictions to

more than 80%. Based on this result, they developed a major marketing campaign to

reduce churn in their customer base. The company spent vast amounts of money

targeting at-risk customers with very little impact on churn and a disastrous impact on

profitability. (Predicting churn and stopping it are different things entirely. For instance, the

amazing discovery was made that unemployed people over 80 years old had a most

regrettable tendency to churn. They died, and no incentive program has much impact on

death!)

Fortunately they were persuaded by the apparent success, at least of the predictive

model, to continue with the project. After going through the full data exploration process,

they ultimately determined that the problem that should have been addressed was

improving retu

rn from underperforming market segments. When appropriate models were

built, the company was able to create highly successful programs to improve the value

that their customer base yielded to them, instead of fighting the apparent dragon of churn.

The value of finding and solving the appropriate problem was worth literally millions of

dollars, and the difference between profit and loss, to this company.

Precise Problem Definition

So how is an appropriate problem discovered? There is a methodology for doing just this.

Start by defining problems in a precise way. Consider, for a moment, how people

generally identify problems. Usually they meet, individually or in groups, and discuss what

they feel to be precise descriptions of problems; on

close examination, however, they are

really general statements. These general statements need to be analyzed into smaller

components that can, in principle at least, be answered by examining data. In one such

discussion with a manufacturer who was concerned with productivity on the assembly

line, the problem was expressed as, “I really need a model of the Monday and Friday

failure rates so we can put a stop to them!” The owner of this problem genuinely thought

this was a precise problem description.

Eventually, this general statement was broken down into quite a large number of

applicable problems and, in this particular case, led to some fairly sophisticated models

eflecting which employees best fit which assembly line profiles, and for which shifts, and

so on. While exploring the problem, it was necessary to define additional issues, such as

what constituted a failure; how failure was detected or measured; why the M

onday and

Friday failure rates were significant; why these failure rates were seen as a problem; was

this in fact a quality problem or a problem with fluctuation of error rates; what problem

components needed to be looked at (equipment, personnel, environmental); and much

more. By the end of the problem space exploration, many more components and

dimensions of the problem were explored and revealed than the company had originally

perceived.

It has been said that a clear statement of a problem is half the battle. It is, and it points

directly to the solution needed. That is what exploring the problem space in a rigorous

manner achieves. Usually (and this was the case with the manufacturer), the exploration

itself yields insights without the application of any automated techniques.

Cognitive Maps

Sometimes the problem space is hard to understand. If it seems difficult to gain insight

into the structure of the problem, or there seem to be many conflicting details, it may be

helpful to structure the problem in some convenient way. One method of structuring a

problem space is by using a tool known as a cognitive map

(Figures 1.2(a) and 1.2(b)). A

useful tool for exploring complex problem spaces, a cognitive map is a physical picture of

what are

perceived as the objects that make up the problem space, together with the

interconnections and interactions of the variables of the objects. It will very often show

where there are conflicting views of the structure of the problem.

剩余465页未读，继续阅读

dengai

粉丝: 3
资源: 70

数据挖掘：预处理关键步骤

一种自动化数据挖掘预处理方法.pdf

Web数据挖掘预处理技术研究.pdf

基于大数据时代的数据挖掘预处理技术研究.pdf

数据挖掘预处理技术的研究.pdf

基于属性拓展的数据挖掘预处理技术研究.pdf

基于全球典型油气田数据库的数据挖掘预处理.pdf

论文研究-基于谱系聚类的粗糙集数据挖掘预处理方法.pdf

网络舆情中的具有影响力个体信息发现与WEB日志数据挖掘预处理技术.pdf

数据挖掘预处理：高效的数据缩减技术探析

数据挖掘预处理关键步骤：清洗、集成与消减

最新资源