二分K-means及其python代码实现
时间: 2023-10-12 21:19:10 浏览: 117
二分K-means是一种改进的K-means算法,它通过对初始聚类中心的选择和分配方式进行优化,可以避免K-means算法陷入局部最优解的问题。以下是二分K-means的算法步骤:
1. 随机选取一个样本作为第一个簇的聚类中心。
2. 对于剩下的样本,计算它们到已有簇中心的距离,并将它们分配到距离最近的簇中。
3. 对于每个簇,计算它的误差平方和(SSE)。
4. 选取SSE最大的簇进行分裂,将其分成两个子簇,使得子簇的SSE之和最小。
5. 重复执行2-4步,直至达到预设的簇数。
以下是二分K-means的Python代码实现:
```python
import numpy as np
def euclidean_distance(x1, x2):
return np.sqrt(np.sum((x1 - x2) ** 2))
class BinaryKMeans:
def __init__(self, k=3, max_iters=10):
self.k = k
self.max_iters = max_iters
def fit(self, X):
self.centroids = [X[np.random.choice(len(X))] for _ in range(self.k)]
self.clusters = [None] * len(X)
for i in range(self.max_iters):
for j, x in enumerate(X):
distances = [euclidean_distance(x, c) for c in self.centroids]
cluster = np.argmin(distances)
self.clusters[j] = cluster
for c in range(self.k):
points = [X[j] for j in range(len(X)) if self.clusters[j] == c]
if len(points) == 0:
continue
sse = np.sum([euclidean_distance(p, self.centroids[c]) ** 2 for p in points])
centroid_left = points[0]
centroid_right = points[0]
for p in points[1:]:
distance_left = euclidean_distance(p, centroid_left)
distance_right = euclidean_distance(p, centroid_right)
if distance_left < distance_right:
centroid_left = centroid_left + (p - centroid_left) / len(points)
else:
centroid_right = centroid_right + (p - centroid_right) / len(points)
sse_left = np.sum([euclidean_distance(p, centroid_left) ** 2 for p in points if euclidean_distance(p, centroid_left) < euclidean_distance(p, centroid_right)])
sse_right = sse - sse_left
if sse_left + sse_right < sse:
self.centroids[c] = centroid_left
self.centroids.append(centroid_right)
self.k += 1
self.clusters = [len(self.centroids) - 1 if self.clusters[j] == c and euclidean_distance(X[j], self.centroids[len(self.centroids) - 1]) < euclidean_distance(X[j], self.centroids[c]) else self.clusters[j] for j in range(len(X))]
return self.clusters
```
其中,euclidean_distance函数用于计算两个向量之间的欧式距离。BinaryKMeans类包含两个方法:__
阅读全文