完成K-means算法的代码实现(同时提交源代码)及数据(至少测试5个数据集,数据集来源建议采用UCI数据集)测试结果
时间: 2024-12-13 14:15:01 浏览: 14
完成K-means聚类算法的代码实现通常会涉及以下几个步骤:
1. 导入必要的库:例如NumPy用于数值计算,pandas处理数据,sklearn提供预处理工具。
```python
import numpy as np
from sklearn.datasets import load_iris, make_blobs, load_digits, fetch_openml
import pandas as pd
```
2. 数据加载和预处理:
- UCI数据集:可以使用`load_iris()`、`make_blobs()`等函数。
- OpenML数据集:`fetch_openml(data_id=your_data_id)`,替换`your_data_id`为你需要的数据ID。
```python
def load_and_preprocess_data(name):
if name == 'iris':
data = load_iris()
elif name == 'blobs':
data = make_blobs(n_samples=100, centers=3)
# 更多数据集...
else:
data = fetch_openml(data_id=your_data_id, return_X_y=True)
X, _ = data
return X
# 测试数据集
data_sets = ['iris', 'blobs', 'digits', 'your_data_name', 'another_data_name']
test_datasets = [load_and_preprocess_data(ds) for ds in data_sets]
```
3. K-means算法的核心实现:
```python
def kmeans(X, k, max_iter=300):
n_samples, n_features = X.shape
centroids = init_centroids(X, k) # 初始化质心
labels = None
for _ in range(max_iter):
# 分配样本到最近的质心
labels = assign_clusters(X, centroids)
# 更新质心
new_centroids = update_centroids(X, labels, k)
# 检查是否收敛
if np.allclose(centroids, new_centroids):
break
centroids = new_centroids
return labels, centroids
# 初始化质心方法(这里简化了)
def init_centroids(X, k):
rand_indices = np.random.choice(len(X), k, replace=False)
centroids = X[rand_indices]
return centroids
# 调整质心的方法
def update_centroids(X, labels, k):
cluster_counts = np.bincount(labels, minlength=k)
new_centroids = []
for i in range(k):
cluster_points = X[labels == i]
new_centroids.append(np.mean(cluster_points, axis=0))
return np.array(new_centroids)
# 转换簇标签到类别
def assign_clusters(X, centroids):
distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=-1)
return np.argmin(distances, axis=1)
```
4. 测试并输出结果:
```python
for dataset in test_datasets:
labels, centroids = kmeans(dataset, k=3)
print(f"Dataset: {dataset.name}, Clusters: {list(range(1, k+1))}")
print("Labels:", labels[:10]) # 输出前几个样本的簇标签
print("Centroids:", centroids)
```
阅读全文