如何使用Python编写一个用于评估聚类问题性能的常用指标代码?
时间: 2024-10-29 13:09:12 浏览: 57
Python聚类算法效果图,写代码需要注意的五个细节.docx
在Python中,评估聚类问题性能常用的指标有轮廓系数(Silhouette Coefficient)、Calinski-Harabasz指数和Davies-Bouldin指数等。下面分别给出这些指标的计算方法。
**1. 轮廓系数(Silhouette Coefficient)**
```python
from sklearn.metrics import silhouette_score
import numpy as np
def silhouette_coefficient(y_true, y_pred):
return silhouette_score(np.array(y_true), np.array(y_pred))
# 使用法:y_true是真实标签,y_pred是预测的聚类标签
silhouette = silhouette_coefficient(y_true, y_pred)
```
**2. Calinski-Harabasz指数**
```python
from scipy.spatial.distance import cdist
from scipy.stats import chi2
from sklearn.metrics.pairwise import pairwise_distances_argmin_min
def calinski_harabasz_score(X, labels):
n_samples, _ = X.shape
n_clusters = len(set(labels)) # unique cluster numbers
intra_sum_of_squares = sum(cdist(X[labels == i], X[labels == i], 'euclidean') ** 2 for i in range(n_clusters))
tot_sum_of_squares = sum((X - X.mean(axis=0)) ** 2, axis=1).sum()
return n_clusters * (tot_sum_of_squares / intra_sum_of_squares)
# 使用法:X是特征矩阵,labels是聚类标签
calinski.harabasz_score(X, labels)
```
**3. Davies-Bouldin指数**
```python
from scipy.spatial.distance import squareform, pdist
from math import sqrt
def davies_bouldin_index(X, labels):
dists = squareform(pdist(X, metric='euclidean'))
sums_of_squares_within = []
sums_of_squares_between = []
for i in range(len(set(labels))):
cluster_members = X[labels == i]
mean_centroid = cluster_members.mean(axis=0)
within_SS = sum((x - mean_centroid)**2 for x in cluster_members)
between_SS = 0
for j in range(i + 1, len(set(labels))):
centroid_j = X[labels == j].mean(axis=0)
diff = abs(mean_centroid - centroid_j)
between_SS += sum(dists[i][k] + dists[k][j] for k in range(len(cluster_members)))
sums_of_squares_between.append(between_SS / len(cluster_members))
sums_of_squares_within.append(within_SS)
DBI = max(sums_of_squares_within) / min(sums_of_squares_between)
return DBI
# 使用法同上
davies_bouldin_index(X, labels)
```
阅读全文