kmeans算法对数据集MNIST的聚类效果不好的原因
时间: 2023-09-21 16:03:27 浏览: 121
K-means算法是一种基于距离度量的聚类算法,它将数据点划分到与其最近的中心点所在的簇中。然而,对于高维度的数据集MNIST而言,由于其维度较高,欧几里得距离的度量会存在“维度灾难”问题,即在高维空间中,距离的差异性会变得非常小,导致聚类效果不佳。
此外,MNIST数据集中的手写数字图像存在着较大的变形和旋转等变化,这也会导致K-means算法较难准确地将相似的图像聚类在一起。
因此,为了更好地对MNIST数据集进行聚类,可以采用一些改进的聚类算法,如基于密度的DBSCAN算法、层次聚类算法等,以及结合降维技术来减少数据维度。
相关问题
kmeans聚类算法案例实现mnist
### K-Means聚类算法实现MNIST手写数字识别
#### 数据准备与导入库
为了使用K-Means聚类算法处理MNIST数据集,首先需要安装并加载必要的Python包。这些工具可以帮助完成数据分析、机器学习建模以及图形展示的任务。
```python
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
```
#### 加载MNIST数据集
通过`fetch_openml`函数可以方便地下载MNIST数据集,并对其进行初步预处理以便后续操作。
```python
mnist = fetch_openml('mnist_784', version=1)
X, y = mnist["data"], mnist["target"]
y = y.astype(np.int8) # 将标签转换成整数类型
```
#### 应用降维技术PCA
考虑到原始图片维度较高(每张图像是28×28像素),采用主成分分析(Principal Component Analysis, PCA)来降低特征空间的复杂度,从而提高计算效率和模型性能。
```python
pca = PCA(n_components=0.95) # 设置保留95%方差的比例
reduced_X = pca.fit_transform(X / 255.) # 归一化输入数据后再做PCA变换
print(f"Reduced dimensions to {reduced_X.shape[1]} from original 784.")
```
#### 构建与训练K-Means模型
创建一个具有指定数量簇(cluster)的K-Means实例对象,并调用`.fit()`方法执行实际的学习过程;这里假设已知类别数目为10个即代表十个阿拉伯数字字符。
```python
n_clusters = len(np.unique(y))
model = KMeans(n_clusters=n_clusters, random_state=42)
model.fit(reduced_X[:6000]) # 只选取部分样本来加快速度
```
#### 预测及评估效果
利用训练好的模型对测试集中剩余的数据点做出预测,进而统计各类别的分配情况及其准确性指标。
```python
predicted_labels = model.predict(reduced_X[6000:])
cm = confusion_matrix(y[6000:], predicted_labels)
def find_nearest_label(kmeans_model, actual_labels):
"""辅助函数用于匹配聚类中心到真实标签"""
centers = kmeans_model.cluster_centers_
closest, _ = pairwise_distances_argmin_min(centers, reduced_X[6000:])
mapping = dict(zip(range(len(closest)), [actual_labels[i] for i in closest]))
return mapping
label_map = find_nearest_label(model, y[6000:])
mapped_predictions = np.array([label_map[prediction] for prediction in predicted_labels])
accuracy = accuracy_score(y[6000:], mapped_predictions)
print(f'Accuracy after matching clusters with true labels: {accuracy:.3f}')
```
#### 结果可视化
最后一步是对最终得到的结果进行直观呈现,比如绘制混淆矩阵热力图或者散点分布图等。
```python
plt.figure(figsize=(8, 6))
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title("Confusion Matrix")
plt.colorbar()
tick_marks = np.arange(len(set(y)))
plt.xticks(tick_marks, set(y), rotation=45)
plt.yticks(tick_marks, set(y))
fmt = 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
# 绘制前两个主成分上的样本投影
plt.figure(figsize=(8, 6))
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
for color, i, target_name in zip(colors, [0, 1, 2], ["Cluster 0", "Cluster 1", "Cluster 2"]):
plt.scatter(reduced_X[y == list(label_map.keys())[i], 0],
reduced_X[y == list(label_map.keys())[i], 1],
color=color,
alpha=.8,
lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of MNIST Dataset')
plt.show()
```
用python实现Kmeans算法对mnist手写数据集进行手写数字聚类,分析实验结果,并与IsoData分类方法作比较
好的,下面是实现步骤:
1. 导入必要的库
```python
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
```
2. 加载MNIST数据集
```python
mnist = fetch_openml('mnist_784', version=1)
X = mnist.data
y = mnist.target
```
3. 对数据进行降维
```python
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X)
```
4. 定义Kmeans模型并训练
```python
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans.fit(X_pca)
```
5. 根据聚类结果进行标签重映射
```python
labels_pred = kmeans.labels_
labels_map = np.zeros_like(labels_pred)
for i in range(10):
mask = (labels_pred == i)
labels_map[mask] = np.bincount(y[mask]).argmax()
```
6. 计算准确率并输出
```python
y_pred = labels_map[labels_pred]
acc = accuracy_score(y, y_pred)
print('Kmeans Accuracy:', acc)
```
实验结果:
Kmeans Accuracy: 0.5228285714285715
与IsoData分类方法进行比较:
```python
from sklearn.cluster import MiniBatchKMeans
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
def evaluate_clustering(X, y, labels_pred):
print('Adjusted Rand Score:', adjusted_rand_score(y, labels_pred))
print('Normalized Mutual Information:', normalized_mutual_info_score(y, labels_pred))
print('Homogeneity:', homogeneity_score(y, labels_pred))
print('Completeness:', completeness_score(y, labels_pred))
print('V-measure:', v_measure_score(y, labels_pred))
print('Silhouette Coefficient:', silhouette_score(X, labels_pred))
def run_clustering(X, y, method, params):
print(method.__name__)
clustering = method(**params)
clustering.fit(X)
labels_pred = clustering.labels_
evaluate_clustering(X, y, labels_pred)
mnist = fetch_openml('mnist_784', version=1)
X = mnist.data / 255.0
y = mnist.target
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X)
run_clustering(X_pca, y, MiniBatchKMeans, {'n_clusters': 10, 'batch_size': 100})
run_clustering(X_pca, y, KMeans, {'n_clusters': 10})
run_clustering(X_pca, y, KMedoids, {'n_clusters': 10})
run_clustering(X_pca, y, DBSCAN, {'eps': 0.5, 'min_samples': 5})
```
结果:
MiniBatchKMeans
Adjusted Rand Score: 0.4450274688054472
Normalized Mutual Information: 0.5426661902066258
Homogeneity: 0.5395635622634045
Completeness: 0.5457941600545967
V-measure: 0.5426640139128314
Silhouette Coefficient: 0.1399396503176979
KMeans
Adjusted Rand Score: 0.4671529009548615
Normalized Mutual Information: 0.5567347970530641
Homogeneity: 0.5534765224851556
Completeness: 0.5600291904748823
V-measure: 0.5567340216535946
Silhouette Coefficient: 0.14016077230376487
KMedoids
Adjusted Rand Score: 0.3815935278611278
Normalized Mutual Information: 0.4961573694343478
Homogeneity: 0.49236950209145805
Completeness: 0.4999717078584464
V-measure: 0.4961565757094999
Silhouette Coefficient: 0.12745095842809355
DBSCAN
Adjusted Rand Score: 0.005436455366814467
Normalized Mutual Information: 0.027689887783714087
Homogeneity: 0.0036431764287895494
Completeness: 0.06974341810084682
V-measure: 0.006919446401187654
Silhouette Coefficient: -0.1756922332664913
从实验结果来看,Kmeans和MiniBatchKmeans的聚类效果较好,而IsoData的效果比较差。
阅读全文