数据挖掘：数据预处理详解

需积分: 10 6 浏览量更新于2023-05-23 2 收藏 3.92MB PDF 举报

"Data Preparation for Data Mining，由Dorian Pyle主编，涉及数据挖掘中的数据预处理过程。本书由Diane D. Cerra担任高级编辑，探讨了数据清洗、转换、集成等关键步骤，旨在提高数据挖掘的效率和准确性。" 在数据挖掘领域，数据准备是至关重要的一个阶段，它直接影响到后续分析的质量和结果的有效性。"Data Preparation for Data Mining"这本书深入讲解了这个过程，包括以下几个核心知识点： 1. 数据清洗：数据通常包含缺失值、异常值和噪声，数据清洗的目标是识别并处理这些问题，确保数据的准确性和完整性。这可能涉及到填补缺失值、删除异常值或进行数据平滑处理。 2. 数据转换：数据转换是为了使数据适合特定的数据挖掘算法。这可能包括标准化（将数据缩放到同一尺度）、归一化（确保数据在0-1范围内）以及编码（例如，将分类变量转化为数值）。 3. 数据集成：在实际项目中，数据往往来自多个不同的源，数据集成是将这些异构数据合并到一起的过程。这需要解决数据不一致性、重复和格式差异等问题。 4. 特征选择：在大量特征中，选择对目标变量最有影响力的特征至关重要。特征选择可以减少计算复杂度，提高模型的解释性和预测性能。 5. 数据采样：数据采样用于创建训练集和测试集，以便评估模型的性能。这包括随机采样、分层采样和过采样/欠采样等策略。 6. 数据降维：当数据维度很高时，可能会导致“维度灾难”。降维技术如主成分分析(PCA)、奇异值分解(SVD)和聚类方法可以帮助降低数据的复杂性。 7. 数据预处理流程：整个数据预处理过程需要有系统性和策略性，包括数据理解、数据清洗、数据转换、数据整合和数据验证等步骤。 8. 实用工具与软件：书中可能还涵盖了R、Python、SQL等工具在数据预处理中的应用，以及开源库如Pandas、NumPy、Scikit-learn等的使用。通过深入理解和实践这些数据预处理技术，数据科学家能够提升数据的质量，从而构建更精确、更可靠的模型，实现有效的数据挖掘。

Surveying the Data

Modeling the Data

This is the “map of the territory” that you should keep in mind as we visit each area and

discuss issues. Figure 1.1 illustrates this map and shows how long each stage typically

takes. It also shows the relative importance of each stage to the success of the project.

Eighty percent of the importance to success comes from finding a suitable problem to

address, defining what success looks like in the form of a solution, and, most critical of all,

implementing the solution. If the final results are not implemented, it i

s impossible for any

project to be successful. On the other hand, mining—preparation, surveying, and

modeling—traditionally takes most of the time in any project. However, after the

importance of actually implementing the result, the two most important contributors to

success are solving an appropriate problem and preparing the data. While implementing

the result is of the first importance to success, it is almost invariably outside the scope of

the data exploration project itself. As such, implementation u

sually requires organizational

or procedural changes inside an organization, which is well outside the scope of this

discussion. Nonetheless, implementation is critical, since without implementing the results

there can be no success.

Figure 1.1 Stages of a data exploration project showing importance and duration

of each stage.

1.1.1 Stage 1: Exploring the Problem Space

This is a critical place to start. It is also the place that, without question, is the source of

most of the misunderstandings and unrealistic expectations from data mining. Quite aside

from the fact that the terms “data exploration” and “data mining” are (incorrectly) used

interchangeably, data mining has been described as “a worm that crawls through your

data and finds golden nuggets.” It has also been described as “a method of automatically

extracting unexpected hidden patterns from data.” It is hard to see any analogous

connection between either data exploration or data mining and metaphorical worms. As

for automatically extracting hidden and unexpected patterns, there is some analogous

truth

to that statement. The real problem is that it gives no flavor for what goes into finding

those hidden patterns, why you would look for them, nor any idea of how to practically use

them when they are found. As a statement, it makes data mining appear to ex

ist in a world

where such things happen by themselves. This leads to “the expectation of magic” from

data mining: wave a magic wand over the data and produce answers to questions you

didn’t even know you had!

Without question, effective data exploration provides a disciplined approach to identifying

business problems and gaining an understanding of data to help solve them. Absolutely

no magic used, guaranteed.

Identifying Problems

The data exploration process starts by identifying the right problems to solve. This is not

as easy as it seems. In one instance, a major telecommunications company insisted that

they had already identified their problem. They were quite certain that the problem was

churn. They listened patiently to the explanat

ion of the data exploration methodology, and

then, deciding it was irrelevant in this case (since they were sure they already understood

the problem), requested a model to predict churn. The requested churn model was duly

built, and most effective it was t

oo. The company’s previous methods yielded about a 50%

accurate prediction model. The new model raised the accuracy of the churn predictions to

more than 80%. Based on this result, they developed a major marketing campaign to

reduce churn in their customer base. The company spent vast amounts of money

targeting at-risk customers with very little impact on churn and a disastrous impact on

profitability. (Predicting churn and stopping it are different things entirely. For instance, the

amazing discovery was made that unemployed people over 80 years old had a most

regrettable tendency to churn. They died, and no incentive program has much impact on

death!)

Fortunately they were persuaded by the apparent success, at least of the predictive

model, to continue with the project. After going through the full data exploration process,

they ultimately determined that the problem that should have been addressed was

improving retu

rn from underperforming market segments. When appropriate models were

built, the company was able to create highly successful programs to improve the value

that their customer base yielded to them, instead of fighting the apparent dragon of churn.

The value of finding and solving the appropriate problem was worth literally millions of

dollars, and the difference between profit and loss, to this company.

Precise Problem Definition

So how is an appropriate problem discovered? There is a methodology for doing just this.

Start by defining problems in a precise way. Consider, for a moment, how people

generally identify problems. Usually they meet, individually or in groups, and discuss what

they feel to be precise descriptions of problems; on

close examination, however, they are

really general statements. These general statements need to be analyzed into smaller

components that can, in principle at least, be answered by examining data. In one such

discussion with a manufacturer who was concerned with productivity on the assembly

line, the problem was expressed as, “I really need a model of the Monday and Friday

failure rates so we can put a stop to them!” The owner of this problem genuinely thought

this was a precise problem description.

Eventually, this general statement was broken down into quite a large number of

applicable problems and, in this particular case, led to some fairly sophisticated models

eflecting which employees best fit which assembly line profiles, and for which shifts, and

so on. While exploring the problem, it was necessary to define additional issues, such as

what constituted a failure; how failure was detected or measured; why the M

onday and

Friday failure rates were significant; why these failure rates were seen as a problem; was

this in fact a quality problem or a problem with fluctuation of error rates; what problem

components needed to be looked at (equipment, personnel, environmental); and much

more. By the end of the problem space exploration, many more components and

dimensions of the problem were explored and revealed than the company had originally

perceived.

It has been said that a clear statement of a problem is half the battle. It is, and it points

directly to the solution needed. That is what exploring the problem space in a rigorous

manner achieves. Usually (and this was the case with the manufacturer), the exploration

itself yields insights without the application of any automated techniques.

Cognitive Maps

Sometimes the problem space is hard to understand. If it seems difficult to gain insight

into the structure of the problem, or there seem to be many conflicting details, it may be

helpful to structure the problem in some convenient way. One method of structuring a

problem space is by using a tool known as a cognitive map

(Figures 1.2(a) and 1.2(b)). A

useful tool for exploring complex problem spaces, a cognitive map is a physical picture of

what are

perceived as the objects that make up the problem space, together with the

interconnections and interactions of the variables of the objects. It will very often show

where there are conflicting views of the structure of the problem.

剩余465页未读，继续阅读

小角色_12138

粉丝: 2
资源: 4

数据挖掘：数据预处理详解

Data Preparation for Data.Mining Using SAS

Data Mining

talend-data preparation中文使用说明

Data Preparation for Data Mining

MK.Java.Data.Mining.Strategy.Standard.and.Practice

Data Mining with Rattle and R.pdf

2017 Data Mining Solutions.pdf

Java.Data.Mining

Data Preprocessing.pdf

data mining

最新资源