用python实现:用pandas库读取csv文件,其中olgt列为起点经度,olat列为起点纬度,绘制样本点分布图,使用Kmeans对其进行聚类,并根据不同k值选择,通过轮廓系数、Calinski-Harabaz指数、肘部法则等评价其聚类效果。要求自行编写聚类函数,不调用包。
时间: 2023-12-22 08:02:33 浏览: 75
以下是Python代码实现:
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# 读取CSV文件
data = pd.read_csv('data.csv')
# 获取起点经度和纬度
X = np.array(data[['olgt', 'olat']])
# 绘制样本点分布图
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
# 自定义K-means聚类函数
def kmeans(X, k, max_iterations=100):
# 随机选择k个初始质心
indices = np.random.choice(len(X), size=k, replace=False)
centers = X[indices]
for i in range(max_iterations):
# 计算每个点到各个质心的距离
distances = np.linalg.norm(X[:, np.newaxis] - centers, axis=2)
# 将每个点归为距离最近的质心所在的簇
labels = np.argmin(distances, axis=1)
# 计算每个簇的新质心
new_centers = np.array([X[labels == j].mean(axis=0) for j in range(k)])
# 如果新质心与旧质心相同,则退出循环
if np.all(centers == new_centers):
break
centers = new_centers
return centers, labels
# 定义轮廓系数函数
def silhouette_coefficient(X, labels):
n = len(X)
a = np.array([np.mean([np.linalg.norm(X[i] - X[j]) for j in range(n) if labels[j] == labels[i]]) for i in range(n)])
b = np.array([np.min([np.mean([np.linalg.norm(X[i] - X[j]) for j in range(n) if labels[j] == k]) for k in set(labels) if k != labels[i]]) for i in range(n)])
s = (b - a) / np.maximum(a, b)
return np.mean(s)
# 定义Calinski-Harabasz指数函数
def calinski_harabasz(X, labels):
n = len(X)
k = len(set(labels))
c = np.array([X[labels == i].mean(axis=0) for i in range(k)])
ssw = sum([np.sum([np.linalg.norm(X[j] - c[i]) ** 2 for j in range(n) if labels[j] == i]) for i in range(k)])
ssb = np.sum([np.sum([np.linalg.norm(c[i] - X.mean(axis=0)) ** 2 for i in range(k)]) for j in range(n)])
ch = (ssb / (k - 1)) / (ssw / (n - k))
return ch
# 使用Kmeans对数据进行聚类,并评价聚类效果
for k in range(2, 6):
centers, labels = kmeans(X, k)
s = silhouette_coefficient(X, labels)
ch = calinski_harabasz(X, labels)
print(f'K={k}, Silhouette Coefficient={s:.4f}, Calinski-Harabasz Index={ch:.4f}')
# 绘制聚类结果
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50)
plt.scatter(centers[:, 0], centers[:, 1], marker='*', s=200, c='black')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
```
注意,在运行代码之前,需要将CSV文件名修改为实际文件名。运行后,会先绘制样本点分布图,然后对数据进行Kmeans聚类,并输出不同K值下的轮廓系数和Calinski-Harabasz指数。最后,会绘制聚类结果图。
阅读全文