对一个csv文件里的数据集进行基于聚类和分类两种离群点检测python代码实现
时间: 2024-05-10 11:16:31 浏览: 5
基于聚类的离群点检测:
```python
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# 读取数据
data = pd.read_csv('data.csv')
# 标准化数据
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# PCA降维
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)
# 训练KMeans模型
kmeans = KMeans(n_clusters=3)
kmeans.fit(data_scaled)
# 计算每个点到所属簇中心的距离
distances = kmeans.transform(data_scaled)
# 计算每个点到最近簇中心的距离
min_distances = np.min(distances, axis=1)
# 标记离群点
outliers = (min_distances > np.percentile(min_distances, 95))
# 绘制聚类结果及离群点
plt.scatter(data_pca[:, 0], data_pca[:, 1], c=kmeans.labels_)
plt.scatter(data_pca[outliers, 0], data_pca[outliers, 1], c='red', marker='x')
plt.show()
```
基于分类的离群点检测:
```python
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# 读取数据
data = pd.read_csv('data.csv')
# 标准化数据
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# PCA降维
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)
# 训练IsolationForest模型
iforest = IsolationForest(n_estimators=100, contamination=0.05)
iforest.fit(data_scaled)
# 预测异常点
outliers = iforest.predict(data_scaled) == -1
# 绘制分类结果及离群点
plt.scatter(data_pca[:, 0], data_pca[:, 1], c=np.where(outliers, 'red', 'blue'))
plt.show()
```
其中,`data.csv`为数据集文件,可以根据需要进行替换。以上代码仅供参考,具体实现需要根据实际情况进行调整。