KMeans算法,对已有数据集“./testSet.txt”文件进行簇划分(可以尝试取不同K值时的效果)。
时间: 2024-02-11 18:09:24 浏览: 84
好的,我理解了你的问题。为了实现KMeans算法对数据集进行簇划分,我们需要先安装Python的数据处理库numpy和数据可视化库matplotlib。你可以在终端或命令提示符中使用以下命令安装:
```
pip install numpy matplotlib
```
然后,我们可以使用以下Python代码实现簇划分:
```python
import numpy as np
import matplotlib.pyplot as plt
# 加载数据集
def loadDataSet(fileName):
dataMat = []
fr = open(fileName)
for line in fr.readlines():
curLine = line.strip().split('\t')
fltLine = list(map(float,curLine))
dataMat.append(fltLine)
return dataMat
# 计算欧氏距离
def distEclud(vecA, vecB):
return np.sqrt(np.sum(np.power(vecA - vecB, 2)))
# 随机初始化簇质心
def randCent(dataSet, k):
n = np.shape(dataSet)[1]
centroids = np.mat(np.zeros((k,n)))
for j in range(n):
minJ = min(dataSet[:,j])
rangeJ = float(max(dataSet[:,j]) - minJ)
centroids[:,j] = np.mat(minJ + rangeJ * np.random.rand(k,1))
return centroids
# KMeans算法
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
m = np.shape(dataSet)[0]
clusterAssment = np.mat(np.zeros((m,2)))
centroids = createCent(dataSet, k)
clusterChanged = True
while clusterChanged:
clusterChanged = False
for i in range(m):
minDist = np.inf; minIndex = -1
for j in range(k):
distJI = distMeas(centroids[j,:],dataSet[i,:])
if distJI < minDist:
minDist = distJI; minIndex = j
if clusterAssment[i,0] != minIndex:
clusterChanged = True
clusterAssment[i,:] = minIndex,minDist**2
print("簇心:\n", centroids)
for cent in range(k):
ptsInClust = dataSet[np.nonzero(clusterAssment[:,0].A==cent)[0]]
centroids[cent,:] = np.mean(ptsInClust, axis=0)
return centroids, clusterAssment
# 绘制数据集和簇
def showCluster(dataSet, k, centroids, clusterAssment):
fig, ax = plt.subplots()
plt.title("KMeans Cluster")
plt.xlabel("X")
plt.ylabel("Y")
plt.scatter(dataSet[:,0], dataSet[:,1], c=clusterAssment[:,0].A.ravel())
plt.scatter(centroids[:,0].A.ravel(), centroids[:,1].A.ravel(), s=100, c="red", marker="+")
for i in range(k):
ptsInClust = dataSet[np.nonzero(clusterAssment[:,0].A==i)[0]]
rect = plt.Rectangle((ptsInClust[:,0].min(),ptsInClust[:,1].min()),ptsInClust[:,0].max()-ptsInClust[:,0].min(),ptsInClust[:,1].max()-ptsInClust[:,1].min(),linewidth=1,edgecolor='r',facecolor='none')
ax.add_patch(rect)
plt.show()
# 测试KMeans算法
dataMat = np.mat(loadDataSet("./testSet.txt"))
k = 4
centroids, clusterAssment = kMeans(dataMat, k)
print("簇心:\n", centroids)
showCluster(dataMat, k, centroids, clusterAssment)
```
在上述代码中,我们首先定义了加载数据集、计算欧氏距离、随机初始化簇质心、KMeans算法和绘制数据集和簇的函数。然后,我们使用`loadDataSet`函数加载数据集,使用`kMeans`函数对数据集进行簇划分,使用`showCluster`函数绘制数据集和簇。最后,我们将数据集和簇绘制在一个图形界面中展示出来。
你可以调整代码中的`k`值,尝试不同的簇数,比如`k=2`或`k=3`,查看不同簇数对簇划分的影响。
阅读全文