基于聚类的离群点检测方法python代码
时间: 2023-12-20 18:04:33 浏览: 104
以下是一个基于聚类的离群点检测方法的Python代码示例:
```python
from sklearn.cluster import KMeans
import numpy as np
def detect_outliers(X, n_clusters=8, contamination=0.05):
# 使用KMeans聚类算法对数据进行聚类
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
# 计算聚类中心
centers = kmeans.cluster_centers_
# 计算每个样本点到其所属聚类中心的距离
distances = np.linalg.norm(X - centers[kmeans.labels_], axis=1)
# 计算每个聚类的平均距离
avg_distances = np.zeros(n_clusters)
for i in range(n_clusters):
avg_distances[i] = np.mean(distances[kmeans.labels_ == i])
# 计算离群点得分
scores = np.zeros(len(X))
for i in range(len(X)):
scores[i] = 1 / (avg_distances[kmeans.labels_[i]] + 1e-8)
# 根据污染率选择离群点
threshold = np.quantile(scores, 1-contamination)
outliers = X[scores > threshold]
return outliers
```
使用方法:
```python
# 生成一些测试数据
X = np.random.randn(1000, 2)
X[:50] += 5
X[50:100] += np.array([5, -5])
X[100:150] += np.array([-5, 5])
X[150:200] += np.array([5, 5])
X[200:250] += np.array([-5, -5])
# 使用基于聚类的离群点检测方法检测离群点
outliers = detect_outliers(X, n_clusters=8, contamination=0.05)
print("离群点数量:", len(outliers))
```
阅读全文