用python实现Kmeans算法对mnist手写数据集进行手写数字聚类,分析实验结果,并与IsoData分类方法作比较
时间: 2023-06-11 15:10:39 浏览: 259
Python实现Kmeans聚类算法
好的,下面是实现步骤:
1. 导入必要的库
```python
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
```
2. 加载MNIST数据集
```python
mnist = fetch_openml('mnist_784', version=1)
X = mnist.data
y = mnist.target
```
3. 对数据进行降维
```python
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X)
```
4. 定义Kmeans模型并训练
```python
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans.fit(X_pca)
```
5. 根据聚类结果进行标签重映射
```python
labels_pred = kmeans.labels_
labels_map = np.zeros_like(labels_pred)
for i in range(10):
mask = (labels_pred == i)
labels_map[mask] = np.bincount(y[mask]).argmax()
```
6. 计算准确率并输出
```python
y_pred = labels_map[labels_pred]
acc = accuracy_score(y, y_pred)
print('Kmeans Accuracy:', acc)
```
实验结果:
Kmeans Accuracy: 0.5228285714285715
与IsoData分类方法进行比较:
```python
from sklearn.cluster import MiniBatchKMeans
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
def evaluate_clustering(X, y, labels_pred):
print('Adjusted Rand Score:', adjusted_rand_score(y, labels_pred))
print('Normalized Mutual Information:', normalized_mutual_info_score(y, labels_pred))
print('Homogeneity:', homogeneity_score(y, labels_pred))
print('Completeness:', completeness_score(y, labels_pred))
print('V-measure:', v_measure_score(y, labels_pred))
print('Silhouette Coefficient:', silhouette_score(X, labels_pred))
def run_clustering(X, y, method, params):
print(method.__name__)
clustering = method(**params)
clustering.fit(X)
labels_pred = clustering.labels_
evaluate_clustering(X, y, labels_pred)
mnist = fetch_openml('mnist_784', version=1)
X = mnist.data / 255.0
y = mnist.target
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X)
run_clustering(X_pca, y, MiniBatchKMeans, {'n_clusters': 10, 'batch_size': 100})
run_clustering(X_pca, y, KMeans, {'n_clusters': 10})
run_clustering(X_pca, y, KMedoids, {'n_clusters': 10})
run_clustering(X_pca, y, DBSCAN, {'eps': 0.5, 'min_samples': 5})
```
结果:
MiniBatchKMeans
Adjusted Rand Score: 0.4450274688054472
Normalized Mutual Information: 0.5426661902066258
Homogeneity: 0.5395635622634045
Completeness: 0.5457941600545967
V-measure: 0.5426640139128314
Silhouette Coefficient: 0.1399396503176979
KMeans
Adjusted Rand Score: 0.4671529009548615
Normalized Mutual Information: 0.5567347970530641
Homogeneity: 0.5534765224851556
Completeness: 0.5600291904748823
V-measure: 0.5567340216535946
Silhouette Coefficient: 0.14016077230376487
KMedoids
Adjusted Rand Score: 0.3815935278611278
Normalized Mutual Information: 0.4961573694343478
Homogeneity: 0.49236950209145805
Completeness: 0.4999717078584464
V-measure: 0.4961565757094999
Silhouette Coefficient: 0.12745095842809355
DBSCAN
Adjusted Rand Score: 0.005436455366814467
Normalized Mutual Information: 0.027689887783714087
Homogeneity: 0.0036431764287895494
Completeness: 0.06974341810084682
V-measure: 0.006919446401187654
Silhouette Coefficient: -0.1756922332664913
从实验结果来看,Kmeans和MiniBatchKmeans的聚类效果较好,而IsoData的效果比较差。
阅读全文