python改进k-means聚类算法,基于能量距离,并将其运用在肺癌基因上,聚类分成三类,分别从样本量以10,30,50,100,200,300,400递推绘制聚类效果图及准确率,并说明数据来源和python代码实现
时间: 2024-05-03 22:21:35 浏览: 115
改进k-means聚类算法基于能量距离的实现:
数据来源:UCI机器学习库中的肺癌基因数据集。
代码实现:
首先,需要导入必要的库:
```python
import numpy as np
import pandas as pd
import random
import math
import matplotlib.pyplot as plt
```
接下来,读取肺癌基因数据集:
```python
data = pd.read_csv('lung_cancer.csv')
```
接着,进行数据预处理,将数据集中的标签列删除,并将数据集转换为numpy数组:
```python
data = data.drop('label', axis=1)
data = np.array(data)
```
然后,实现能量距离计算公式:
```python
def energy_distance(x1, x2, sigma):
n = x1.shape[0]
res = 0
for i in range(n):
res += (math.exp(-(x1[i]-x2[i])**2/(2*sigma**2)) - math.exp(-(x1[i]-x2[i])**2/(2*(2*sigma)**2)))**2
return res
```
接下来,实现改进的k-means聚类算法:
```python
def k_means_energy(data, k, max_iter, sigma):
n = data.shape[0]
m = data.shape[1]
centroids = np.zeros((k, m))
for i in range(k):
centroids[i] = data[random.randint(0, n-1)]
cluster = np.zeros(n)
for i in range(max_iter):
for j in range(n):
min_dist = float('inf')
for l in range(k):
dist = energy_distance(data[j], centroids[l], sigma)
if dist < min_dist:
min_dist = dist
cluster[j] = l
for l in range(k):
centroids[l] = np.mean(data[cluster==l], axis=0)
return cluster
```
最后,实现绘制聚类效果图和计算准确率的函数:
```python
def plot_cluster_result(cluster, data, k):
colors = ['r', 'g', 'b', 'c', 'm', 'y', 'k', 'w']
for i in range(k):
plt.scatter(data[cluster==i,0], data[cluster==i,1], c=colors[i%8], s=10)
plt.show()
def calculate_accuracy(cluster, true_labels):
n = cluster.shape[0]
correct = 0
for i in range(n):
if cluster[i] == true_labels[i]:
correct += 1
return correct/n
```
接着,可以使用以上实现的函数进行聚类,并绘制效果图和计算准确率:
```python
true_labels = pd.read_csv('lung_cancer.csv')['label']
true_labels = np.array(true_labels)
k = 3
max_iter = 100
sigma = 1
sample_sizes = [10, 30, 50, 100, 200, 300, 400]
accuracies = []
for sample_size in sample_sizes:
indices = random.sample(range(data.shape[0]), sample_size)
sample_data = data[indices]
cluster = k_means_energy(sample_data, k, max_iter, sigma)
plot_cluster_result(cluster, sample_data, k)
accuracy = calculate_accuracy(cluster, true_labels[indices])
accuracies.append(accuracy)
print('Sample size:', sample_size, 'Accuracy:', accuracy)
plt.plot(sample_sizes, accuracies)
plt.xlabel('Sample size')
plt.ylabel('Accuracy')
plt.show()
```
最终,可以得到不同样本量下的聚类效果图和准确率曲线图。
阅读全文