python代码实现ADASYN
时间: 2023-10-12 21:10:17 浏览: 210
ADASYN(Adaptive Synthetic Sampling)是一种基于K近邻的不平衡数据集处理方法,其主要思想是在少数类样本周围生成新的合成样本来平衡数据集。下面是使用Python实现ADASYN的示例代码:
```python
import numpy as np
from sklearn.neighbors import NearestNeighbors
def adasyn(X, y, k=5, ratio=0.5):
"""
:param X: ndarray, shape (n_samples, n_features), 特征矩阵
:param y: ndarray, shape (n_samples,), 标签矩阵
:param k: int, default 5, K近邻数
:param ratio: float, default 0.5, 少数类的比例
:return: X_resampled, y_resampled
"""
# 统计各类别样本数,以及少数类样本数
class_counts = np.bincount(y)
minority_count = np.min(class_counts)
majority_count = np.max(class_counts)
minority_label = np.argmin(class_counts)
majority_label = np.argmax(class_counts)
# 如果数据集已经平衡,则直接返回
if minority_count == majority_count:
return X, y
# 对少数类样本进行K近邻搜索
knn = NearestNeighbors(n_neighbors=k)
knn.fit(X[y == minority_label])
minority_neighbors = knn.kneighbors(X[y == minority_label])[1]
# 计算需要合成的新样本数
syn_num = int((majority_count - minority_count) * ratio)
# 生成新的合成样本
synthetic_X = np.zeros((syn_num, X.shape[1]))
for i in range(syn_num):
# 随机选择一个少数类样本
idx = np.random.choice(minority_count)
# 选择该样本的K近邻
nn = np.random.choice(minority_neighbors[idx])
# 计算插值系数
lambda_ = np.random.random()
# 生成新样本
synthetic_X[i] = X[idx] + lambda_ * (X[nn] - X[idx])
# 将合成样本与原始样本合并
X_resampled = np.vstack((X, synthetic_X))
y_resampled = np.hstack((y, np.array([minority_label] * syn_num)))
return X_resampled, y_resampled
```
使用方法如下:
```python
from sklearn.datasets import make_classification
from collections import Counter
import matplotlib.pyplot as plt
# 生成不平衡数据集
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.9, 0.1], random_state=42)
# 统计类别分布
print('Original dataset shape %s' % Counter(y))
# 对数据集进行ADASYN采样
X_resampled, y_resampled = adasyn(X, y)
# 统计采样后的类别分布
print('Resampled dataset shape %s' % Counter(y_resampled))
# 可视化采样结果
fig, axs = plt.subplots(1, 2, figsize=(12, 6))
axs[0].scatter(X[:, 0], X[:, 1], c=y)
axs[0].set_title('Original dataset')
axs[1].scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
axs[1].set_title('Resampled dataset')
plt.show()
```
参考文献:
[1] He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 1322-1328). IEEE.
阅读全文