Introductiontodatamining_transfer-appropriateprocessing

data

mining

需积分: 9 114 浏览量更新于2023-03-16 评论收藏 68.52MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

Contents

Preface vii

1 Introduction 1

1.1 WhatIsDataMining?....................... 2

1.2 MotivatingChallenges ....................... 4

1.3 TheOriginsofDataMining.................... 6

1.4 DataMiningTasks......................... 7

1.5 ScopeandOrganizationoftheBook ............... 11

1.6 BibliographicNotes......................... 13

1.7 Exercises .............................. 16

2Data 19

2.1 TypesofData............................ 22

2.1.1 AttributesandMeasurement ............... 23

2.1.2 TypesofDataSets..................... 29

2.2 DataQuality ............................ 36

2.2.1 Measurement and Data Collection Issues . . ....... 37

2.2.2 IssuesRelatedtoApplications .............. 43

2.3 DataPreprocessing......................... 44

2.3.1 Aggregation......................... 45

2.3.2 Sampling .......................... 47

2.3.3 DimensionalityReduction................. 50

2.3.4 Feature Subset Selection . . . ............... 52

2.3.5 FeatureCreation ...................... 55

2.3.6 DiscretizationandBinarization.............. 57

2.3.7 VariableTransformation.................. 63

2.4 MeasuresofSimilarityandDissimilarity............. 65

2.4.1 Basics ............................ 66

2.4.2 Similarity and Dissimilarity between Simple Attributes . 67

2.4.3 Dissimilarities between Data Objects ........... 69

2.4.4 Similarities between Data Objects . ........... 72

xiv Contents

2.4.5 ExamplesofProximityMeasures............. 73

2.4.6 IssuesinProximityCalculation.............. 80

2.4.7 Selecting the Right Proximity Measure . . . ....... 83

2.5 BibliographicNotes......................... 84

2.6 Exercises .............................. 88

3 Exploring Data 97

3.1 TheIrisDataSet.......................... 98

3.2 SummaryStatistics......................... 98

3.2.1 FrequenciesandtheMode................. 99

3.2.2 Percentiles ......................... 100

3.2.3 MeasuresofLocation:MeanandMedian ........ 101

3.2.4 MeasuresofSpread:RangeandVariance ........ 102

3.2.5 MultivariateSummaryStatistics ............. 104

3.2.6 OtherWaystoSummarizetheData ........... 105

3.3 Visualization ............................ 105

3.3.1 MotivationsforVisualization ............... 105

3.3.2 GeneralConcepts...................... 106

3.3.3 Techniques ......................... 110

3.3.4 VisualizingHigher-DimensionalData........... 124

3.3.5 Do’sandDon’ts ...................... 130

3.4 OLAPandMultidimensionalDataAnalysis........... 131

3.4.1 Representing Iris Data as a Multidimensional Array . . 131

3.4.2 MultidimensionalData:TheGeneralCase........ 133

3.4.3 AnalyzingMultidimensionalData ............ 135

3.4.4 Final Comments on Multidimensional Data Analysis . . 139

3.5 BibliographicNotes......................... 139

3.6 Exercises .............................. 141

4 Classiﬁcation:

Basic Concepts, Decision Trees, and Model Evaluation 145

4.1 Preliminaries ............................ 146

4.2 General Approach to Solving a Classiﬁcation Problem . . . . . 148

4.3 Decision Tree Induction . . . ................... 150

4.3.1 HowaDecisionTreeWorks................ 150

4.3.2 HowtoBuildaDecisionTree............... 151

4.3.3 Methods for Expressing Attribute Test Conditions . . . 155

4.3.4 Measures for Selecting the Best Split ........... 158

4.3.5 Algorithm for Decision Tree Induction . . . ....... 164

4.3.6 An Example: Web Robot Detection ........... 166

Contents xv

4.3.7 Characteristics of Decision Tree Induction . ....... 168

4.4 ModelOverﬁtting.......................... 172

4.4.1 OverﬁttingDuetoPresenceofNoise........... 175

4.4.2 Overﬁtting Due to Lack of Representative Samples . . . 177

4.4.3 Overﬁtting and the Multiple Comparison Procedure . . 178

4.4.4 Estimation of Generalization Errors ........... 179

4.4.5 Handling Overﬁtting in Decision Tree Induction . . . . 184

4.5 EvaluatingthePerformanceofaClassiﬁer............ 186

4.5.1 HoldoutMethod ...................... 186

4.5.2 Random Subsampling ................... 187

4.5.3 Cross-Validation ...................... 187

4.5.4 Bootstrap.......................... 188

4.6 MethodsforComparingClassiﬁers ................ 188

4.6.1 Estimating a Conﬁdence Interval for Accuracy . . . . . 189

4.6.2 ComparingthePerformanceofTwoModels....... 191

4.6.3 Comparing the Performance of Two Classiﬁers . . . . . 192

4.7 BibliographicNotes......................... 193

4.8 Exercises .............................. 198

5 Classiﬁcation: Alternative Techniques 207

5.1 Rule-BasedClassiﬁer........................ 207

5.1.1 HowaRule-BasedClassiﬁerWorks............ 209

5.1.2 Rule-OrderingSchemes .................. 211

5.1.3 HowtoBuildaRule-BasedClassiﬁer........... 212

5.1.4 DirectMethodsforRuleExtraction ........... 213

5.1.5 IndirectMethodsforRuleExtraction .......... 221

5.1.6 Characteristics of Rule-Based Classiﬁers . . ....... 223

5.2 Nearest-Neighborclassiﬁers .................... 223

5.2.1 Algorithm.......................... 225

5.2.2 Characteristics of Nearest-Neighbor Classiﬁers . . . . . 226

5.3 BayesianClassiﬁers......................... 227

5.3.1 BayesTheorem....................... 228

5.3.2 UsingtheBayesTheoremforClassiﬁcation ....... 229

5.3.3 Na¨ıveBayesClassiﬁer ................... 231

5.3.4 BayesErrorRate...................... 238

5.3.5 BayesianBeliefNetworks ................. 240

5.4 Artiﬁcial Neural Network (ANN) . . ............... 246

5.4.1 Perceptron ......................... 247

5.4.2 MultilayerArtiﬁcialNeuralNetwork ........... 251

5.4.3 Characteristics of ANN . . . ............... 255

xvi Contents

5.5 Support Vector Machine (SVM) . . . ............... 256

5.5.1 MaximumMarginHyperplanes.............. 256

5.5.2 LinearSVM:SeparableCase ............... 259

5.5.3 LinearSVM:NonseparableCase ............. 266

5.5.4 NonlinearSVM....................... 270

5.5.5 CharacteristicsofSVM .................. 276

5.6 EnsembleMethods......................... 276

5.6.1 RationaleforEnsembleMethod.............. 277

5.6.2 Methods for Constructing an Ensemble Classiﬁer . . . . 278

5.6.3 Bias-VarianceDecomposition ............... 281

5.6.4 Bagging . . . . ....................... 283

5.6.5 Boosting........................... 285

5.6.6 RandomForests ...................... 290

5.6.7 Empirical Comparison among Ensemble Methods . . . . 294

5.7 ClassImbalanceProblem ..................... 294

5.7.1 AlternativeMetrics..................... 295

5.7.2 The Receiver Operating Characteristic Curve . . . . . . 298

5.7.3 Cost-SensitiveLearning .................. 302

5.7.4 Sampling-BasedApproaches................ 305

5.8 MulticlassProblem......................... 306

5.9 BibliographicNotes......................... 309

5.10Exercises .............................. 315

6 Association Analysis: Basic Concepts and Algorithms 327

6.1 ProblemDeﬁnition......................... 328

6.2 FrequentItemsetGeneration ................... 332

6.2.1 The Apriori Principle ................... 333

6.2.2 Frequent Itemset Generation in the Apriori Algorithm . 335

6.2.3 CandidateGenerationandPruning............ 338

6.2.4 Support Counting . . ................... 342

6.2.5 ComputationalComplexity ................ 345

6.3 RuleGeneration .......................... 349

6.3.1 Conﬁdence-Based Pruning . . ............... 350

6.3.2 Rule Generation in Apriori Algorithm.......... 350

6.3.3 AnExample:CongressionalVotingRecords....... 352

6.4 CompactRepresentationofFrequentItemsets.......... 353

6.4.1 MaximalFrequentItemsets ................ 354

6.4.2 ClosedFrequentItemsets ................. 355

6.5 Alternative Methods for Generating Frequent Itemsets . . . . . 359

6.6 FP-GrowthAlgorithm ....................... 363

Contents xvii

6.6.1 FP-TreeRepresentation .................. 363

6.6.2 Frequent Itemset Generation in FP-Growth Algorithm . 366

6.7 Evaluation of Association Patterns . ............... 370

6.7.1 Objective Measures of Interestingness . . . ....... 371

6.7.2 MeasuresbeyondPairsofBinaryVariables ....... 382

6.7.3 Simpson’sParadox..................... 384

6.8 Eﬀect of Skewed Support Distribution . . . ........... 386

6.9 BibliographicNotes......................... 390

6.10Exercises .............................. 404

7 Association Analysis: Advanced Concepts 415

7.1 HandlingCategoricalAttributes ................. 415

7.2 HandlingContinuousAttributes ................. 418

7.2.1 Discretization-BasedMethods............... 418

7.2.2 Statistics-BasedMethods ................. 422

7.2.3 Non-discretizationMethods................ 424

7.3 HandlingaConceptHierarchy .................. 426

7.4 Sequential Patterns . . ....................... 429

7.4.1 ProblemFormulation ................... 429

7.4.2 Sequential Pattern Discovery ............... 431

7.4.3 TimingConstraints..................... 436

7.4.4 AlternativeCountingSchemes .............. 439

7.5 Subgraph Patterns . . ....................... 442

7.5.1 Graphs and Subgraphs ................... 443

7.5.2 Frequent Subgraph Mining . ............... 444

7.5.3 Apriori-likeMethod .................... 447

7.5.4 CandidateGeneration ................... 448

7.5.5 CandidatePruning..................... 453

7.5.6 Support Counting . . ................... 457

7.6 Infrequent Patterns . . ....................... 457

7.6.1 Negative Patterns . . ................... 458

7.6.2 Negatively Correlated Patterns . . . ........... 458

7.6.3 Comparisons among Infrequent Patterns, Negative Pat-

terns, and Negatively Correlated Patterns . ....... 460

7.6.4 Techniques for Mining Interesting Infrequent Patterns . 461

7.6.5 Techniques Based on Mining Negative Patterns . . . . . 463

7.6.6 Techniques Based on Support Expectation . ....... 465

7.7 BibliographicNotes......................... 469

7.8 Exercises .............................. 473

xviii Contents

8 Cluster Analysis: Basic Concepts and Algorithms 487

8.1 Overview .............................. 490

8.1.1 WhatIsClusterAnalysis?................. 490

8.1.2 DiﬀerentTypesofClusterings............... 491

8.1.3 DiﬀerentTypesofClusters ................ 493

8.2 K-means............................... 496

8.2.1 TheBasicK-meansAlgorithm .............. 497

8.2.2 K-means:AdditionalIssues ................ 506

8.2.3 Bisecting K-means . . ................... 508

8.2.4 K-meansandDiﬀerentTypesofClusters ........ 510

8.2.5 StrengthsandWeaknesses................. 510

8.2.6 K-meansasanOptimizationProblem .......... 513

8.3 AgglomerativeHierarchicalClustering .............. 515

8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm 516

8.3.2 SpeciﬁcTechniques..................... 518

8.3.3 The Lance-Williams Formula for Cluster Proximity . . . 524

8.3.4 KeyIssuesinHierarchicalClustering........... 524

8.3.5 StrengthsandWeaknesses................. 526

8.4 DBSCAN .............................. 526

8.4.1 Traditional Density: Center-Based Approach . . . . . . 527

8.4.2 TheDBSCANAlgorithm ................. 528

8.4.3 StrengthsandWeaknesses................. 530

8.5 ClusterEvaluation ......................... 532

8.5.1 Overview .......................... 533

8.5.2 Unsupervised Cluster Evaluation Using Cohesion and

Separation ......................... 536

8.5.3 Unsupervised Cluster Evaluation Using the Proximity

Matrix............................ 542

8.5.4 Unsupervised Evaluation of Hierarchical Clustering . . . 544

8.5.5 Determining the Correct Number of Clusters . . . . . . 546

8.5.6 ClusteringTendency.................... 547

8.5.7 Supervised Measures of Cluster Validity . . ....... 548

8.5.8 Assessing the Signiﬁcance of Cluster Validity Measures . 553

8.6 BibliographicNotes......................... 555

8.7 Exercises .............................. 559

9 Cluster Analysis: Additional Issues and Algorithms 569

9.1 Characteristics of Data, Clusters, and Clustering Algorithms . 570

9.1.1 Example: Comparing K-means and DBSCAN . . . . . . 570

9.1.2 DataCharacteristics.................... 571

剩余201页未读，继续阅读

ym1133

粉丝: 0
资源: 1

会员权益专享

Introduction to data mining

评论0

会员权益专享

最新资源

Introduction to data mining

评论0

Introduction to Data Mining(数据挖掘概论)

Introduction to Data Mining 数据挖掘导论 part2

introduction to data mining

国外文本挖掘研究现状和参考文献

数据挖掘导论英文pdf

数据分析的学习路径和书籍推荐

监督学习算法与非监督学习算法的文献。

数据爬取与可视化分析参考文献

国外文本挖掘应用参考文献

SVM的提出的参考文献

计算机毕设英文参考文献

Knn算法的基本介绍以及使用的语言环境介绍 2、算法的运行举例（截图或者图表）以及性能比较 3、算法的改进、变种以及其解决了什么具体的现实问题 要求：5篇参考文献以上

SVM算法的基本介绍以及使用的语言环境介绍 2、算法的运行举例（截图或者图表）以及性能比较 3、算法的改进、变种以及其解决了什么具体的现实问题 要求：5篇参考文献以上

帮我写出他的三、模型的假设 1.3符号说明 1.5模型假设 1.6模型建立 1.7模型求解 1.8模型结果分析 1.9模型优缺点 2.0改进方向 2.1参考文献

Introduction to data Mining(book+PPT)

Introduction to Data Mining英文版+PPT

Introduction to data mining 课后答案 169页

node-v0.10.13-sunos-x86.tar.gz

课设毕设基于SSM的高校二手交易平台-LW+PPT+源码可运行.zip

会员权益专享

最新资源

Knn算法的基本介绍以及使用的语言环境介绍 2、算法的运行举例（截图或者图表）以及性能比较 3、算法的改进、变种以及其解决了什么具体的现实问题要求：5篇参考文献以上

SVM算法的基本介绍以及使用的语言环境介绍 2、算法的运行举例（截图或者图表）以及性能比较 3、算法的改进、变种以及其解决了什么具体的现实问题要求：5篇参考文献以上