欠采样代码
时间: 2023-07-03 12:23:44 浏览: 96
digita-signal.zip_matlab sampling _欠采样_过采样_采样_采样信号
5星 · 资源好评率100%
欠采样(undersampling)是一种减少数据集中某些类别样本数量的方法,以解决类别不平衡(class imbalance)的问题。下面是一个简单的欠采样代码实现:
```python
import numpy as np
from collections import Counter
def undersample(X, y, ratio=1.0):
"""
X: 特征矩阵,shape为(n_samples, n_features)
y: 标签,shape为(n_samples,)
ratio: 欠采样后负样本占比,默认为1.0,即正负样本数量相等
"""
# 统计类别数量
counter = Counter(y)
n_neg = counter[0] # 负样本数量
n_pos = counter[1] # 正样本数量
# 计算欠采样后负样本数量
n_neg_undersample = int(n_pos * ratio)
if n_neg_undersample > n_neg:
raise ValueError("Undersample ratio is too large.")
# 从负样本中随机抽取n_neg_undersample个样本
neg_indices = np.where(y == 0)[0]
neg_indices_undersample = np.random.choice(neg_indices, size=n_neg_undersample, replace=False)
# 合并正负样本的索引
indices_undersample = np.concatenate([neg_indices_undersample, np.where(y == 1)[0]])
# 返回欠采样后的数据
return X[indices_undersample], y[indices_undersample]
```
使用示例:
```python
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# 生成一个不平衡的二分类数据集
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, weights=[0.9, 0.1], random_state=42)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 欠采样训练集
X_train_undersample, y_train_undersample = undersample(X_train, y_train, ratio=0.5)
# 训练逻辑回归模型
clf = LogisticRegression(random_state=42)
clf.fit(X_train_undersample, y_train_undersample)
# 在测试集上评估模型性能
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)
```
欠采样后,负样本占比为正样本的50%,可以根据实际情况调整占比。需要注意的是,欠采样可能会导致数据丢失,因此需要谨慎使用。
阅读全文