数据聚类算法与应用深度解析

需积分: 14 174 浏览量更新于2024-07-19 收藏 12.69MB PDF 举报

"Data clustering algorithms and application，这是一本2014年由Taylor&FrancisGroup出版的高清PDF文件，属于Chapman&Hall/CRC Data Mining and Knowledge Discovery Series系列，专注于聚类分析的算法和应用，是数据挖掘和知识发现的经典参考资料。" 在数据科学领域，聚类分析是一种无监督学习方法，它旨在根据数据的相似性或距离将其分组到不同的簇或类别中。这本书可能涵盖了多种聚类算法，这些算法在理解和探索大量复杂数据集的结构时非常有用。以下是一些可能涵盖的关键知识点： 1. **基本概念**：书中可能会介绍聚类的基本概念，如距离度量（欧氏距离、曼哈顿距离、余弦相似度等）、相似性度量以及聚类的目标和挑战。 2. **常用聚类算法**： - **层次聚类**：包括凝聚型和分裂型两种，如单链接、全链接、平均链接等。 - **K-means算法**：一种迭代的中心点方法，寻找使所有点到其最近中心点平方和最小的K个簇。 - **DBSCAN（密度基空间聚类）**：基于密度的聚类方法，能发现任意形状的簇，并自动处理噪声点。 - **谱聚类**：利用数据的相似性矩阵构造图，然后通过图谱理论进行聚类。 - **BIRCH（平衡迭代减少和聚类树）**：用于大规模数据集的聚类，通过构建层次结构来减小内存需求。 3. **评估与选择聚类算法**：书中可能涉及聚类质量的评估方法，如轮廓系数、Calinski-Harabasz指数和Davies-Bouldin指数，以及如何根据数据特性选择合适的聚类算法。 4. **应用领域**：聚类分析广泛应用于市场细分、生物信息学、图像分析、社交网络分析、推荐系统等领域。书中的实例可能涵盖这些领域的具体应用和案例研究。 5. **算法优化与改进**：可能讨论了针对特定问题的算法优化技术，如并行化聚类、分布式计算和内存优化策略。 6. **数据预处理**：预处理在聚类中至关重要，可能包括缺失值处理、异常值检测、特征选择和标准化等。 7. **可视化**：聚类结果的可视化工具和技术，如散点图、热力图和树状图，帮助理解聚类结构。 8. **隐私与安全**：在数据挖掘过程中，如何保护个人隐私和数据安全可能也是书中探讨的一个方面。这本书对于想要深入理解聚类算法及其实际应用的数据科学家和研究人员来说，是一份宝贵的资源。通过学习和实践书中的内容，读者能够掌握聚类分析的核心技术和如何将这些技术应用到实际问题中。

Contents xv

15.5.4 TrajectoryClustering ........................... 372

15.6 Time-SeriesClusteringApplications ........................ 374

15.7 Conclusions ..................................... 375

16 Clustering Biological Data 381

Chandan K. Reddy, Mohammad Al Hasan, and Mohammed J. Zaki

16.1 Introduction . . . . . ................................ 382

16.2 ClusteringMicroarrayData ............................. 383

16.2.1 ProximityMeasures ............................ 383

16.2.2 CategorizationofAlgorithms ....................... 384

16.2.3 StandardClusteringAlgorithms...................... 385

16.2.3.1 HierarchicalClustering..................... 385

16.2.3.2 Probabilistic Clustering . . . . ................. 386

16.2.3.3 Graph-TheoreticClustering................... 386

16.2.3.4 Self-OrganizingMaps...................... 387

16.2.3.5 Other Clustering Methods . . ................. 387

16.2.4 Biclustering ................................ 388

16.2.4.1 TypesandStructuresofBiclusters ............... 389

16.2.4.2 BiclusteringAlgorithms .................... 390

16.2.4.3 Recent Developments . . . . . ................. 391

16.2.5 Triclustering................................ 391

16.2.6 Time-SeriesGeneExpressionDataClustering .............. 392

16.2.7 ClusterValidation ............................. 393

16.3 ClusteringBiologicalNetworks .......................... 394

16.3.1 CharacteristicsofPPINetworkData ................... 394

16.3.2 NetworkClusteringAlgorithms...................... 394

16.3.2.1 MolecularComplexDetection ................. 394

16.3.2.2 MarkovClustering ....................... 395

16.3.2.3 Neighborhood Search Methods ................. 395

16.3.2.4 CliquePercolationMethod................... 395

16.3.2.5 EnsembleClustering ...................... 396

16.3.2.6 Other Clustering Methods . . ................. 396

16.3.3 ClusterValidationandChallenges..................... 397

16.4 BiologicalSequenceClustering........................... 397

16.4.1 SequenceSimilarityMetrics........................ 397

16.4.1.1 Alignment-BasedSimilarity .................. 398

16.4.1.2 Keyword-BasedSimilarity ................... 398

16.4.1.3 Kernel-BasedSimilarity .................... 399

16.4.1.4 Model-BasedSimilarity..................... 399

16.4.2 SequenceClusteringAlgorithms ..................... 399

16.4.2.1 Subsequence-BasedClustering................. 399

16.4.2.2 Graph-BasedClustering .................... 400

16.4.2.3 Probabilistic Models . . . . . ................. 402

16.4.2.4 SufﬁxTreeandSufﬁxArray-BasedMethod.......... 403

16.5 SoftwarePackages ................................. 403

16.6 DiscussionandSummary .............................. 405

xvi Contents

17 Network Clustering 415

Srinivasan Parthasarathy and S M Faisal

17.1 Introduction . . . . . ................................ 416

17.2 Background and Nomenclature . . . ........................ 417

17.3 Problem Deﬁnition . ................................ 417

17.4 CommonEvaluationCriteria ............................ 418

17.5 Partitioning with Geometric Information . . . . . ................. 419

17.5.1 CoordinateBisection............................ 419

17.5.2 InertialBisection.............................. 419

17.5.3 Geometric Partitioning . . . ........................ 420

17.6 GraphGrowingandGreedyAlgorithms ...................... 421

17.6.1 Kernighan-LinAlgorithm ......................... 422

17.7 AgglomerativeandDivisiveClustering....................... 423

17.8 SpectralClustering ................................. 424

17.8.1 SimilarityGraphs ............................. 425

17.8.2 TypesofSimilarityGraphs ........................ 425

17.8.3 GraphLaplacians ............................. 426

17.8.3.1 Unnormalized Graph Laplacian . . . ............. 426

17.8.3.2 NormalizedGraphLaplacians ................. 427

17.8.4 SpectralClusteringAlgorithms ...................... 427

17.9 MarkovClustering ................................. 428

17.9.1 RegularizedMCL(RMCL):ImprovementoverMCL .......... 429

17.10 Multilevel Partitioning . . . ............................ 430

17.11 Local Partitioning Algorithms . . . ........................ 432

17.12 Hypergraph Partitioning . . ............................ 433

17.13 Emerging Methods for Partitioning Special Graphs . . . ............. 435

17.13.1 Bipartite Graphs . . ............................ 435

17.13.2 DynamicGraphs.............................. 436

17.13.3 HeterogeneousNetworks ......................... 437

17.13.4 DirectedNetworks............................. 438

17.13.5 CombiningContentandRelationshipInformation ............ 439

17.13.6 Networks with Overlapping Communities . . . ............. 440

17.13.7 Probabilistic Methods . . . ........................ 442

17.14Conclusion ..................................... 443

18 A Survey of Uncertain Data Clustering Algorithms 457

Charu C. Aggarwal

18.1 Introduction . . . . . ................................ 457

18.2 MixtureModelClusteringofUncertainData.................... 459

18.3 Density-BasedClusteringAlgorithms ....................... 460

18.3.1 FDBSCANAlgorithm........................... 460

18.3.2 FOPTICSAlgorithm............................ 461

18.4 Partitional Clustering Algorithms . . ........................ 462

18.4.1 TheUK-MeansAlgorithm......................... 462

18.4.2 TheCK-MeansAlgorithm......................... 463

18.4.3 Clustering Uncertain Data with Voronoi Diagrams . . . ......... 464

18.4.4 ApproximationAlgorithmsforClusteringUncertainData ........ 464

18.4.5 SpeedingUpDistanceComputations ................... 465

18.5 ClusteringUncertainDataStreams ......................... 466

18.5.1 TheUMicroAlgorithm .......................... 466

18.5.2 TheLuMicroAlgorithm.......................... 471

Contents xvii

18.5.3 EnhancementstoStreamClustering.................... 471

18.6 Clustering Uncertain Data in High Dimensionality ................. 472

18.6.1 SubspaceClusteringofUncertainData .................. 473

18.6.2 UPStream:ProjectedClusteringofUncertainDataStreams ....... 474

18.7 ClusteringwiththePossibleWorldsModel .................... 477

18.8 ClusteringUncertainGraphs ............................ 478

18.9 ConclusionsandSummary ............................. 478

19 Concepts of Visual and Interactive Clustering 483

Alexander Hinneburg

19.1 Introduction . . . . . ................................ 483

19.2 DirectVisualandInteractiveClustering ...................... 484

19.2.1 Scatterplots................................. 485

19.2.2 ParallelCoordinates............................ 488

19.2.3 Discussion................................. 491

19.3 VisualInteractiveSteeringofClustering ...................... 491

19.3.1 VisualAssessmentofConvergenceofClusteringAlgorithm....... 491

19.3.2 InteractiveHierarchicalClustering .................... 492

19.3.3 VisualClusteringwithSOMs ....................... 494

19.3.4 Discussion................................. 494

19.4 InteractiveComparisonandCombinationofClusterings.............. 495

19.4.1 SpaceofClusterings............................ 495

19.4.2 Visualization................................ 497

19.4.3 Discussion................................. 497

19.5 VisualizationofClustersforSense-Making .................... 497

19.6 Summary ...................................... 500

20 Semisupervised Clustering 505

Amrudin Agovic and Arindam Banerjee

20.1 Introduction . . . . . ................................ 506

20.2 ClusteringwithPointwiseandPairwiseSemisupervision ............. 507

20.2.1 SemisupervisedClusteringBasedonSeeding............... 507

20.2.2 SemisupervisedClusteringBasedonPairwiseConstraints........ 508

20.2.3 ActiveLearningforSemisupervisedClustering.............. 511

20.2.4 SemisupervisedClusteringBasedonUserFeedback........... 512

20.2.5 Semisupervised Clustering Based on Nonnegative Matrix Factorization . 513

20.3 SemisupervisedGraphCuts............................. 513

20.3.1 Semisupervised Unnormalized Cut . . . ................. 515

20.3.2 SemisupervisedRatioCut......................... 515

20.3.3 SemisupervisedNormalizedCut...................... 516

20.4 A UniﬁedViewofLabelPropagation ....................... 517

20.4.1 GeneralizedLabelPropagation ...................... 517

20.4.2 GaussianFields .............................. 517

20.4.3 Tikhonov Regularization (TIKREG) . . ................. 518

20.4.4 LocalandGlobalConsistency....................... 518

20.4.5 Related Methods . . ............................ 519

20.4.5.1 ClusterKernels ......................... 519

20.4.5.2 Gaussian Random Walks EM (GWEM) . . . ......... 519

20.4.5.3 Linear Neighborhood Propagation . . ............. 520

20.4.6 LabelPropagationandGreen’sFunction ................. 521

20.4.7 LabelPropagationandSemisupervisedGraphCuts............ 521

xviii Contents

20.5 SemisupervisedEmbedding............................. 521

20.5.1 NonlinearManifoldEmbedding...................... 522

20.5.2 SemisupervisedEmbedding........................ 522

20.5.2.1 UnconstrainedSemisupervisedEmbedding .......... 523

20.5.2.2 ConstrainedSemisupervisedEmbedding............ 523

20.6 ComparativeExperimentalAnalysis ........................ 524

20.6.1 ExperimentalResults ........................... 524

20.6.2 Semisupervised Embedding Methods . . ................. 529

20.7 Conclusions ..................................... 530

21 Alternative Clustering Analysis: A Review 535

James Bailey

21.1 Introduction . . . . . ................................ 535

21.2 TechnicalPreliminaries ............................... 537

21.3 Multiple Clustering Analysis Using Alternative Clusterings . . . ......... 538

21.3.1 Alternative Clustering Algorithms: A Taxonomy ............. 538

21.3.2 Unguided Generation . . . ........................ 539

21.3.2.1 Naive .............................. 539

21.3.2.2 MetaClustering......................... 539

21.3.2.3 EigenvectorsoftheLaplacianMatrix.............. 540

21.3.2.4 Decorrelated k-MeansandConvolutionalEM ......... 540

21.3.2.5 CAMI.............................. 540

21.3.3 GuidedGenerationwithConstraints.................... 541

21.3.3.1 COALA............................. 541

21.3.3.2 ConstrainedOptimizationApproach.............. 541

21.3.3.3 MAXIMUS........................... 542

21.3.4 Orthogonal Transformation Approaches ................. 543

21.3.4.1 Orthogonal Views ........................ 543

21.3.4.2 ADFT.............................. 543

21.3.5 InformationTheoretic........................... 544

21.3.5.1 Conditional Information Bottleneck (CIB) . . ......... 544

21.3.5.2 ConditionalEnsembleClustering................ 544

21.3.5.3 NACI .............................. 544

21.3.5.4 mSC............................... 545

21.4 ConnectionstoMultiviewClusteringandSubspaceClustering .......... 545

21.5 FutureResearchIssues ............................... 547

21.6 Summary ...................................... 547

22 Cluster Ensembles: Theory and Applications 551

Joydeep Ghosh and Ayan Acharya

22.1 Introduction . . . . . ................................ 551

22.2 TheClusterEnsembleProblem ........................... 554

22.3 MeasuringSimilarityBetweenClusteringSolutions ................ 555

22.4 ClusterEnsembleAlgorithms............................ 558

22.4.1 Probabilistic Approaches to Cluster Ensembles . ............. 558

22.4.1.1 AMixtureModelforClusterEnsembles(MMCE) ...... 558

22.4.1.2 BayesianClusterEnsembles(BCE) .............. 558

22.4.1.3 Nonparametric Bayesian Cluster Ensembles (NPBCE) . . . . 559

22.4.2 PairwiseSimilarity-BasedApproaches .................. 560

22.4.2.1 Methods Based on Ensemble Co-Association Matrix ..... 560

Contents xix

22.4.2.2 Relating Consensus Clustering to Other Optimization Formu-

lations.............................. 562

22.4.3 DirectApproachesUsingClusterLabels ................. 562

22.4.3.1 Graph Partitioning . . . . . . ................. 562

22.4.3.2 CumulativeVoting ....................... 563

22.5 ApplicationsofConsensusClustering ....................... 564

22.5.1 GeneExpressionDataAnalysis...................... 564

22.5.2 ImageSegmentation............................ 564

22.6 ConcludingRemarks ................................ 566

23 Clustering Validation Measures 571

Hui Xiong and Zhongmou Li

23.1 Introduction . . . . . ................................ 572

23.2 ExternalClusteringValidationMeasures ...................... 573

23.2.1 AnOverviewofExternalClusteringValidationMeasures ........ 574

23.2.2 DefectiveValidationMeasures ...................... 575

23.2.2.1 K-Means:TheUniformEffect ................. 575

23.2.2.2 ANecessarySelectionCriterion ................ 576

23.2.2.3 TheClusterValidationResults ................. 576

23.2.2.4 TheIssueswiththeDefectiveMeasures ............ 577

23.2.2.5 ImprovingtheDefectiveMeasures............... 577

23.2.3 MeasureNormalization .......................... 577

23.2.3.1 NormalizingtheMeasures ................... 578

23.2.3.2 The DCV Criterion ....................... 581

23.2.3.3 TheEffectofNormalization .................. 583

23.2.4 MeasureProperties............................. 584

23.2.4.1 TheConsistencyBetweenMeasures .............. 584

23.2.4.2 PropertiesofMeasures ..................... 586

23.2.4.3 Discussions........................... 589

23.3 InternalClusteringValidationMeasures ...................... 589

23.3.1 AnOverviewofInternalClusteringValidationMeasures......... 589

23.3.2 UnderstandingofInternalClusteringValidationMeasures........ 592

23.3.2.1 The Impact of Monotonicity . ................. 592

23.3.2.2 TheImpactofNoise ...................... 593

23.3.2.3 TheImpactofDensity ..................... 594

23.3.2.4 TheImpactofSubclusters ................... 595

23.3.2.5 TheImpactofSkewedDistributions .............. 596

23.3.2.6 TheImpactofArbitraryShapes ................ 598

23.3.3 PropertiesofMeasures........................... 600

23.4 Summary ...................................... 601

24 Educational and Software Resources for Data Clustering 607

Charu C. Aggarwal and Chandan K. Reddy

24.1 Introduction . . . . . ................................ 607

24.2 EducationalResources ............................... 608

24.2.1 Books on Data Clustering . ........................ 608

24.2.2 Popular Survey Papers on Data Clustering . . . ............. 608

24.3 SoftwareforDataClustering ............................ 610

24.3.1 FreeandOpen-SourceSoftware...................... 610

24.3.1.1 GeneralClusteringSoftware .................. 610

24.3.1.2 SpecializedClusteringSoftware ................ 610

剩余647页未读，继续阅读

aeou123

粉丝: 1
资源: 19

数据聚类算法与应用深度解析

Algorithms for Clustering Data（2）

Algorithms for Clustering Data（1）

Data clustering algorithm and application

Evaluation Methods for Unsupervised Learning: Assessing the Performance of Clustering Algorithms

Error Correction coding——mathematical methods and algorithms

[Practical Exercise] Data Storage and Analysis: Storing Scraped Data into MySQL and Performing Data ...

Application of MATLAB Genetic Algorithms in Bioinformatics: Frontier Research and Case Studies

Application of MATLAB in Environmental Sciences: Case Analysis and Exploration of Optimization ...

Application of MATLAB Optimization Algorithms in Transportation Logistics: Complete Analysis of ...

MATLAB Reading Excel Data Machine Learning Application: Mining Value from Data

最新资源