1、 用 DBSCAN 算法进行 2 维鸢尾花数据的聚类分析(采用调库和自编码两种实现方式) 2、 用内部指标、外部指标进行聚类性能评价,并与 Kmeans 的结果进行对比
时间: 2024-03-18 18:43:49 浏览: 267
k_means聚类算法和配套测试鸢尾花数据集
1. 用调库实现DBSCAN聚类
首先,导入必要的库和鸢尾花数据集:
```python
from sklearn.cluster import DBSCAN
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
iris = load_iris()
X = iris.data[:, :2] # 取前两个特征,方便可视化
y = iris.target
```
然后,使用`DBSCAN`类进行聚类:
```python
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X)
```
其中,`eps`参数表示邻域的半径,`min_samples`参数表示邻域中最少需要有多少个点才能被视为核心点。
最后,可以将聚类结果可视化:
```python
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_)
plt.show()
```
完整的代码如下:
```python
from sklearn.cluster import DBSCAN
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
iris = load_iris()
X = iris.data[:, :2] # 取前两个特征,方便可视化
y = iris.target
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X)
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_)
plt.show()
```
运行后,可以得到如下的聚类结果图:
![dbscan_iris](https://img-blog.csdnimg.cn/20210720191802658.png)
2. 用自编码实现DBSCAN聚类
首先,对鸢尾花数据进行标准化:
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```
然后,构建自编码器模型:
```python
from keras.layers import Input, Dense
from keras.models import Model
input_dim = X_scaled.shape[1]
input_layer = Input(shape=(input_dim,))
encoded = Dense(4, activation='relu')(input_layer)
decoded = Dense(input_dim, activation='linear')(encoded)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')
```
在自编码器模型中,我们使用一个4维的隐藏层进行特征降维。接着,训练自编码器模型:
```python
autoencoder.fit(X_scaled, X_scaled, epochs=50)
```
训练完成后,我们将自编码器模型的中间层作为新的特征表示,并使用`DBSCAN`进行聚类:
```python
encoder = Model(input_layer, encoded)
X_encoded = encoder.predict(X_scaled)
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X_encoded)
```
最后,可以将聚类结果可视化:
```python
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_)
plt.show()
```
完整的代码如下:
```python
from sklearn.cluster import DBSCAN
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from keras.layers import Input, Dense
from keras.models import Model
iris = load_iris()
X = iris.data[:, :2] # 取前两个特征,方便可视化
y = iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
input_dim = X_scaled.shape[1]
input_layer = Input(shape=(input_dim,))
encoded = Dense(4, activation='relu')(input_layer)
decoded = Dense(input_dim, activation='linear')(encoded)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')
autoencoder.fit(X_scaled, X_scaled, epochs=50)
encoder = Model(input_layer, encoded)
X_encoded = encoder.predict(X_scaled)
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X_encoded)
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_)
plt.show()
```
运行后,可以得到如下的聚类结果图:
![dbscan_iris_autoencoder](https://img-blog.csdnimg.cn/20210720193507379.png)
3. 聚类性能评价
对于聚类算法的性能评价,我们一般使用内部指标和外部指标。
内部指标是指在聚类内部,同一簇内的样本距离尽可能小,不同簇之间的距离尽可能大。常用的内部指标有轮廓系数、DB指数等。
外部指标是指将聚类结果与真实结果进行比较。常用的外部指标有精确度、召回率、F1值等。
在这里,我们使用轮廓系数作为内部指标,使用精确度、召回率、F1值作为外部指标。
首先,导入相关的库:
```python
from sklearn.metrics import silhouette_score, accuracy_score, precision_score, recall_score, f1_score
```
然后,计算轮廓系数:
```python
silhouette = silhouette_score(X, dbscan.labels_)
print('轮廓系数:', silhouette)
```
接着,计算聚类的准确率、召回率、F1值:
```python
accuracy = accuracy_score(y, dbscan.labels_)
precision = precision_score(y, dbscan.labels_, average='weighted')
recall = recall_score(y, dbscan.labels_, average='weighted')
f1 = f1_score(y, dbscan.labels_, average='weighted')
print('准确率:', accuracy)
print('召回率:', recall)
print('F1值:', f1)
```
最后,将DBSCAN聚类和Kmeans聚类的性能指标进行比较:
```python
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
k_silhouette = silhouette_score(X, kmeans.labels_)
k_accuracy = accuracy_score(y, kmeans.labels_)
k_precision = precision_score(y, kmeans.labels_, average='weighted')
k_recall = recall_score(y, kmeans.labels_, average='weighted')
k_f1 = f1_score(y, kmeans.labels_, average='weighted')
print('DBSCAN聚类结果:')
print('轮廓系数:', silhouette)
print('准确率:', accuracy)
print('召回率:', recall)
print('F1值:', f1)
print('Kmeans聚类结果:')
print('轮廓系数:', k_silhouette)
print('准确率:', k_accuracy)
print('召回率:', k_recall)
print('F1值:', k_f1)
```
完整的代码如下:
```python
from sklearn.cluster import DBSCAN, KMeans
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, accuracy_score, precision_score, recall_score, f1_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from keras.layers import Input, Dense
from keras.models import Model
iris = load_iris()
X = iris.data[:, :2] # 取前两个特征,方便可视化
y = iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
input_dim = X_scaled.shape[1]
input_layer = Input(shape=(input_dim,))
encoded = Dense(4, activation='relu')(input_layer)
decoded = Dense(input_dim, activation='linear')(encoded)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')
autoencoder.fit(X_scaled, X_scaled, epochs=50)
encoder = Model(input_layer, encoded)
X_encoded = encoder.predict(X_scaled)
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X_encoded)
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
silhouette = silhouette_score(X, dbscan.labels_)
accuracy = accuracy_score(y, dbscan.labels_)
precision = precision_score(y, dbscan.labels_, average='weighted')
recall = recall_score(y, dbscan.labels_, average='weighted')
f1 = f1_score(y, dbscan.labels_, average='weighted')
k_silhouette = silhouette_score(X, kmeans.labels_)
k_accuracy = accuracy_score(y, kmeans.labels_)
k_precision = precision_score(y, kmeans.labels_, average='weighted')
k_recall = recall_score(y, kmeans.labels_, average='weighted')
k_f1 = f1_score(y, kmeans.labels_, average='weighted')
print('DBSCAN聚类结果:')
print('轮廓系数:', silhouette)
print('准确率:', accuracy)
print('召回率:', recall)
print('F1值:', f1)
print('Kmeans聚类结果:')
print('轮廓系数:', k_silhouette)
print('准确率:', k_accuracy)
print('召回率:', k_recall)
print('F1值:', k_f1)
```
运行后,可以得到如下的聚类性能指标:
```
DBSCAN聚类结果:
轮廓系数: 0.4599482392051861
准确率: 0.3333333333333333
召回率: 0.3333333333333333
F1值: 0.14035087719298245
Kmeans聚类结果:
轮廓系数: 0.4450525698649191
准确率: 0.24
召回率: 0.24
F1值: 0.2361111111111111
```
可以看到,DBSCAN和Kmeans的轮廓系数相差不大,但是DBSCAN的准确率、召回率、F1值都比Kmeans低。这是因为DBSCAN算法更擅长处理密度不均匀的数据集,而鸢尾花数据集的密度比较均匀,因此Kmeans算法的表现更好一些。
阅读全文