大数据挖掘聚类算法研究综述

需积分: 10 5 浏览量更新于2024-09-09 收藏 437KB PDF 举报

"这篇PDF文章是对数据挖掘领域中聚类算法的综合研究，主要探讨了大数据的特点，并对不同类型的聚类算法进行了概述，包括分区、层次、密度、网格和基于模型的聚类方法。作者T.Sajana、C.M.Sheela Rani和K.V.Narayana来自印度KL大学。文章在引言中介绍了大数据的概念，指出其在数据挖掘环境中的处理挑战，以及传统数据处理应用的局限性。" 本文的核心内容主要围绕以下几个方面展开： 1. **大数据的特点**：大数据是指由大量复杂数据组成的集合，这些数据量大到无法用常规的数据处理工具进行有效分析。它们具有高增长率、多样性、高速生成和价值密度低等特点。大数据的处理需要新型的技术和算法来挖掘隐藏的模式和知识。 2. **聚类算法分类**： - **分区聚类**：如K-Means、K-Medoids等，这类算法将数据集分成预定数量的不重叠组，每个数据点属于且仅属于一个群组。 - **层次聚类**：包括凝聚型和分裂型，如单链接、完全链接、平均链接等，通过构建树状结构来表示数据之间的相似性。 - **密度聚类**：如DBSCAN、OPTICS等，基于数据点的密度来定义聚类，能较好地处理噪声和不规则形状的簇。 - **网格聚类**：如STING、CLIQUE等，将数据空间划分为网格，统计每个网格内的数据点，找出高密度区域。 - **模型基聚类**：如EM（期望最大化）算法，基于概率模型进行聚类，可以处理混合分布的数据。 3. **聚类算法在大数据挖掘中的应用**：聚类是数据挖掘的关键技术之一，尤其在大数据环境中，用于发现数据的自然群体结构，无须预先知道类别信息。它在市场细分、社交网络分析、图像分割、生物信息学等多个领域都有广泛应用。 4. **挑战与未来趋势**：随着大数据的持续增长，聚类算法面临计算效率、内存需求、可扩展性和准确性等方面的挑战。未来的聚类研究可能会关注更高效、适应性强的算法，以及融合多种聚类策略的集成方法。这篇文章对理解大数据背景下聚类算法的现状及发展趋势提供了全面的视角，对于研究人员和实践者来说，是了解聚类算法在大数据挖掘中应用的重要参考资料。

Vol 9 (3) | January 2016 | www.indjst.org

Indian Journal of Science and Technology

T. Sajana, C. M. Sheela Rani and K. V. Narayana

further clusters until desired no of clusters are formed.

BIRCH, CURE, ROCK, Chameleon, Echidna, Wards,

SNN, GRIDCLUST, CACTUS are some of Hierarchical

clustering algorithms in which clusters of Non convex,

Arbitrary Hyper rectangular are formed.

3.3 Density based Clustering algorithms:

Data objects are categorized into core points, border

points and noise points. All the core points are connected

together based on the densities to form cluster. Arbitrary

shaped clusters are formed by various clustering

algorithms such as DBSCAN, OPTICS, DBCLASD,

GDBSCAN, DENCLU and SUBCLU.

3.4 Grid based Clustering algorithms:

Grid based algorithm partitions the data set into no

number of cells to form a grid structure. Clusters are

formed based on the grid structure. To form clusters

Grid algorithm uses subspace and hierarchical clustering

techniques. STING, CLIQUE, Wave cluster, BANG,

OptiGrid, MAFIA, ENCLUS, PROCLUS, ORCLUS, FC

and STIRR. Compare to all Clustering algorithms Grid

algorithms are very fast processing algorithms. Uniform

grid algorithms are not sucient to form desired clusters.

To overcome these problem Adaptive grid algorithms

such as MAFIA and AMR Arbitrary shaped clusters are

formed by the grid cells.

3.5 Model based Clustering algorithms:

Set of data points are connected together based on

various strategies like statistical methods, conceptual

methods, and robust clustering methods. ere are two

approaches for model based algorithms one is neural

network approach and another one is statistical approach.

Algorithms such as EM, COBWEB, CLASSIT, SOM, and

SLINK are well known Model based clustering algorithms.

4. Comparison of Clustering

Algorithms

Various clustering methods discussed which mine

the data from Big Data. Every algorithm has its own

greatness and weakness. is paper presents various

clustering algorithms related to the 4 V’s of Big Data

characteristics.

4.1 Volume:

it refers to the ability of an algorithm to deal with large

amounts of a data. With respect to the Volume property

the criteria for clustering algorithms to be considered is

a. Size of the data set b. High dimensionality c. Handling

Outliers.

• Size of the data set: Data set is collection of attributes.

e attributes are categorical, nominal, ordinal, in-

terval and ratio. Many clustering algorithms support

numerical and categorical data.

• High dimensionality: To handle big data as the size of

data set increases no of dimensions are also increases.

It is the curse of dimensionality.

• Outliers: Many clustering algorithms are capable of

handle outliers. Noise data cannot be making a group

with data points.

4.2 Variety:

refers to the ability of a clustering algorithm to handle

dierent types of data sets such as numerical, categorical,

nominal and ordinal. A criterion for clustering algorithms

is (a) type of data set (b) cluster shape.

• Type of data set: e size of the data set is small or big

but many of the clustering algorithms support large

data sets for big data mining.

• Cluster shape: Depends on the data set size and type

shape of the cluster formed.

4.3 Velocity:

Refers to the computations of clustering algorithm

based on the criteria (a) running time complexity of a

clustering algorithm.

• Time complexity: If the computations of algorithms

take very less no then algorithm has less run time.

e algorithms the run time calculation done based

on Big O notation.

4.4 Value:

For a clustering algorithm to process the data

accurately and to form a cluster with less computation

input parameter are play key role. e values of various

clustering algorithms are given in Table 1.

剩余11页未读，继续阅读

dream16108017

粉丝: 0
资源: 4

大数据挖掘聚类算法研究综述

数据挖掘中聚类分析综述.pdf

数据挖掘中的聚类算法综述.

数据挖掘 聚类算法(国外期刊资料)

数据挖掘中聚类算法综述.pdf

数据挖掘中聚类分析综述.docx

数据挖掘之聚类算法综述.pdf

数据挖掘聚类算法研究综述：20年来的发展与新趋势

信息技术下数据挖掘聚类算法深度剖析与发展趋势

数据挖掘中聚类算法的综述

数据挖掘中的聚类算法综述

最新资源

数据挖掘聚类算法(国外期刊资料)