Rattle与R的数据挖掘实战：知识发现的艺术

需积分: 9 182 浏览量更新于2024-07-17 收藏 11.35MB PDF 举报

"Data Mining with Rattle and R.pdf" 是一本关于使用Rattle和R进行数据挖掘的书籍，旨在引导读者通过实践经验探索数据挖掘的基本概念和流行算法。书中的内容涵盖了数据理解、数据预处理、数据清洗、模型构建、模型评估以及实际部署，特别强调了Rattle这款基于R统计软件的易用且免费的开源数据挖掘工具。在数据挖掘领域，本书作者Graham Williams着重介绍了如何利用Rattle和R的组合，创建一个功能强大、不逊于商业软件的数据挖掘环境。Rattle（R Analytical Tool To Learn Easily）是R语言的一个图形用户界面，它简化了数据挖掘过程，使得初学者和专业人士都能快速上手进行项目实施。书中讨论的关键知识点包括： 1. **数据理解**：这一阶段涉及到对原始数据的初步探索，包括了解数据集的结构、缺失值、异常值和潜在的关联模式。Rattle提供了可视化工具帮助理解数据分布和关系。 2. **数据预处理**：数据预处理是数据挖掘的重要步骤，包括数据清洗（处理缺失值和异常值）、数据转换（标准化、归一化）、特征选择等。R和Rattle提供了多种函数来执行这些操作。 3. **数据清洗**：处理不完整或不准确的数据，如缺失值的填充（平均值、中位数、众数等）和异常值的检测与处理，确保模型的训练基于高质量的数据。 4. **模型构建**：涉及选择适当的算法来构建预测或分类模型，如决策树、随机森林、支持向量机、聚类分析等。R提供了丰富的机器学习库（如`randomForest`, `caret`, `e1071`等）。 5. **模型评估**：通过交叉验证、ROC曲线、精确度、召回率等指标来评估模型的性能。Rattle可以帮助比较不同模型的效果，以便选择最佳模型。 6. **实际部署**：将建立的模型应用于实际问题，如预测、分类或模式识别，并将结果以报告形式呈现，以便业务决策。 7. **R语言和Rattle的结合**：Rattle作为R语言的前端，简化了数据导入、探索、建模和报告的流程，使得数据挖掘过程更加直观，而R语言的强大计算能力和丰富的统计包则为模型构建提供了坚实后盾。本书适合数据科学初学者和有一定统计基础的读者，通过实例和实践操作，让读者掌握数据挖掘的核心技巧和方法，同时也适合在教育环境中作为教材或参考书使用。通过阅读此书，读者将能够免费获取并使用强大的数据挖掘工具，从而提升数据分析和知识发现的能力。

Contents

Preface vii

I Explorations 1

1 Introduction 3

1.1 Data Mining Beginnings . . . . . . . . . . . . . . . . . . . 5

1.2 The Data Mining Team . . . . . . . . . . . . . . . . . . . 5

1.3 Agile Data Mining . . . . . . . . . . . . . . . . . . . . . . 6

1.4 The Data Mining Process . . . . . . . . . . . . . . . . . . 7

1.5 A Typical Journey . . . . . . . . . . . . . . . . . . . . . . 8

1.6 Insights for Data Mining . . . . . . . . . . . . . . . . . . . 9

1.7 Documenting Data Mining . . . . . . . . . . . . . . . . . . 10

1.8 Tools for Data Mining: R . . . . . . . . . . . . . . . . . . 10

1.9 Tools for Data Mining: Rattle . . . . . . . . . . . . . . . . 11

1.10 Why R and Rattle? . . . . . . . . . . . . . . . . . . . . . . 13

1.11 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.12 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Getting Started 21

2.1 Starting R . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Quitting Rattle and R . . . . . . . . . . . . . . . . . . . . 24

2.3 First Contact . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 Loading a Dataset . . . . . . . . . . . . . . . . . . . . . . 26

2.5 Building a Model . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 Understanding Our Data . . . . . . . . . . . . . . . . . . . 31

2.7 Evaluating the Model: Confusion Matrix . . . . . . . . . . 35

2.8 Interacting with Rattle . . . . . . . . . . . . . . . . . . . . 39

2.9 Interacting with R . . . . . . . . . . . . . . . . . . . . . . 43

xvi Contents

2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.11 Command Summary . . . . . . . . . . . . . . . . . . . . . 55

3 Working with Data 57

3.1 Data Nomenclature . . . . . . . . . . . . . . . . . . . . . . 58

3.2 Sourcing Data for Mining . . . . . . . . . . . . . . . . . . 61

3.3 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.4 Data Matching . . . . . . . . . . . . . . . . . . . . . . . . 63

3.5 Data Warehousing . . . . . . . . . . . . . . . . . . . . . . 65

3.6 Interacting with Data Using R . . . . . . . . . . . . . . . 68

3.7 Documenting the Data . . . . . . . . . . . . . . . . . . . . 71

3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.9 Command Summary . . . . . . . . . . . . . . . . . . . . . 74

4 Loading Data 75

4.1 CSV Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2 ARFF Data . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.3 ODBC Sourced Data . . . . . . . . . . . . . . . . . . . . . 84

4.4 R Dataset—Other Data Sources . . . . . . . . . . . . . . 87

4.5 R Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.6 Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.7 Data Options . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.8 Command Summary . . . . . . . . . . . . . . . . . . . . . 97

5 Exploring Data 99

5.1 Summarising Data . . . . . . . . . . . . . . . . . . . . . . 100

5.1.1 Basic Summaries . . . . . . . . . . . . . . . . . . . 101

5.1.2 Detailed Numeric Summaries . . . . . . . . . . . . 103

5.1.3 Distribution . . . . . . . . . . . . . . . . . . . . . . 105

5.1.4 Skewness . . . . . . . . . . . . . . . . . . . . . . . 105

5.1.5 Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . 106

5.1.6 Missing Values . . . . . . . . . . . . . . . . . . . . 106

5.2 Visualising Distributions . . . . . . . . . . . . . . . . . . . 108

5.2.1 Box Plot . . . . . . . . . . . . . . . . . . . . . . . 110

5.2.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . 114

5.2.3 Cumulative Distribution Plot . . . . . . . . . . . . 116

5.2.4 Benford’s Law . . . . . . . . . . . . . . . . . . . . 119

5.2.5 Bar Plot . . . . . . . . . . . . . . . . . . . . . . . . 120

5.2.6 Dot Plot . . . . . . . . . . . . . . . . . . . . . . . . 121

Contents xvii

5.2.7 Mosaic Plot . . . . . . . . . . . . . . . . . . . . . . 122

5.2.8 Pairs and Scatter Plots . . . . . . . . . . . . . . . 123

5.2.9 Plots with Groups . . . . . . . . . . . . . . . . . . 127

5.3 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . 128

5.3.1 Correlation Plot . . . . . . . . . . . . . . . . . . . 128

5.3.2 Missing Value Correlations . . . . . . . . . . . . . 132

5.3.3 Hierarchical Correlation . . . . . . . . . . . . . . . 133

5.4 Command Summary . . . . . . . . . . . . . . . . . . . . . 135

6 Interactive Graphics 137

6.1 Latticist . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.2 GGobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.3 Command Summary . . . . . . . . . . . . . . . . . . . . . 148

7 Transforming Data 149

7.1 Data Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.2 Transforming Data . . . . . . . . . . . . . . . . . . . . . . 153

7.3 Rescaling Data . . . . . . . . . . . . . . . . . . . . . . . . 154

7.4 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.5 Recoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

7.6 Cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7.7 Command Summary . . . . . . . . . . . . . . . . . . . . . 167

II Building Models 169

8 Descriptive and Predictive Analytics 171

8.1 Model Nomenclature . . . . . . . . . . . . . . . . . . . . . 172

8.2 A Framework for Modelling . . . . . . . . . . . . . . . . . 172

8.3 Descriptive Analytics . . . . . . . . . . . . . . . . . . . . . 175

8.4 Predictive Analytics . . . . . . . . . . . . . . . . . . . . . 175

8.5 Model Builders . . . . . . . . . . . . . . . . . . . . . . . . 176

9 Cluster Analysis 179

9.1 Knowledge Representation . . . . . . . . . . . . . . . . . . 180

9.2 Search Heuristic . . . . . . . . . . . . . . . . . . . . . . . 181

9.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

9.4 Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . 185

9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

9.6 Command Summary . . . . . . . . . . . . . . . . . . . . . 191

xviii Contents

10 Association Analysis 193

10.1 Knowledge Representation . . . . . . . . . . . . . . . . . . 194

10.2 Search Heuristic . . . . . . . . . . . . . . . . . . . . . . . 195

10.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

10.4 Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . 197

10.5 Command Summary . . . . . . . . . . . . . . . . . . . . . 203

11 Decision Trees 205

11.1 Knowledge Representation . . . . . . . . . . . . . . . . . . 206

11.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

11.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

11.4 Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . 215

11.5 Tuning Parameters . . . . . . . . . . . . . . . . . . . . . . 230

11.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

11.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

11.8 Command Summary . . . . . . . . . . . . . . . . . . . . . 243

12 Random Forests 245

12.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

12.2 Knowledge Representation . . . . . . . . . . . . . . . . . . 247

12.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

12.4 Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . 249

12.5 Tuning Parameters . . . . . . . . . . . . . . . . . . . . . . 261

12.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

12.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

12.8 Command Summary . . . . . . . . . . . . . . . . . . . . . 268

13 Boosting 269

13.1 Knowledge Representation . . . . . . . . . . . . . . . . . . 270

13.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

13.3 Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . 272

13.4 Tuning Parameters . . . . . . . . . . . . . . . . . . . . . . 285

13.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

13.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

13.7 Command Summary . . . . . . . . . . . . . . . . . . . . . 291

14 Support Vector Machines 293

14.1 Knowledge Representation . . . . . . . . . . . . . . . . . . 294

14.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

剩余381页未读，继续阅读

henanchly

粉丝: 16

Rattle与R的数据挖掘实战：知识发现的艺术

Data Mining With Rattle and R

Predictive.Analytics.using.Rattle.and.Qlik.Sense

pmml-rattle-1.0.18.zip

Rattler_32.exe

Rattle-tutorial-for-Windows.pdf

R语言编程基础-教学大纲.pdf

德语中的R怎么念借鉴.pdf

rattle.rar

可视化数据挖掘技术研究.pdf

Altair_Feko_User_Guide.pdf

最新资源