序列不平衡类别下采样代码
时间: 2024-01-26 08:49:35 浏览: 32
下采样是一种常见的解决类别不平衡问题的方法之一。下面是一个简单的 Python 代码示例,用于在序列不平衡的情况下进行下采样:
```python
import random
def downsample(X, y, target_size):
"""
X: 输入序列
y: 对应的标签
target_size: 下采样后的目标序列大小
"""
class_counts = {}
for label in y:
if label not in class_counts:
class_counts[label] = 0
class_counts[label] += 1
new_X = []
new_y = []
for label in class_counts:
if class_counts[label] >= target_size:
# 如果该类别的样本数已经大于等于目标样本数,直接加入新序列
indices = [i for i, x in enumerate(y) if x == label]
new_X.extend([X[i] for i in indices[:target_size]])
new_y.extend([label] * target_size)
else:
# 否则对该类别进行下采样
indices = [i for i, x in enumerate(y) if x == label]
new_X.extend([X[i] for i in indices])
new_y.extend([label] * len(indices))
while len(new_y) < target_size:
idx = random.randint(0, len(indices)-1)
new_X.append(X[indices[idx]])
new_y.append(label)
return new_X, new_y
```
使用示例:
```python
X = ["hello", "world", "how", "are", "you", "doing", "today", "I", "am", "fine"]
y = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
new_X, new_y = downsample(X, y, 4)
print(new_X) # ["hello", "world", "how", "are", "I", "am", "fine"]
print(new_y) # [0, 0, 0, 0, 1, 1, 1]
```
在上面的示例中,我们将目标序列大小设置为 4,由于类别 0 中的样本数大于 4,因此我们直接选择了前 4 个样本,类别 1 中的样本数小于 4,因此我们对其进行了下采样,最终得到了大小为 4 的新序列。
阅读全文