首页我拥有一个标注好的数据集，四个类别，其中一个类别样本数量过大，需要提供一个的随机欠采样的完整代码，需要直接输入就能使用，需要显示处理前后的结果

我拥有一个标注好的数据集，四个类别，其中一个类别样本数量过大，需要提供一个的随机欠采样的完整代码，需要直接输入就能使用，需要显示处理前后的结果

时间: 2024-05-16 18:14:41 浏览: 77

好的，以下是一个简单的随机欠采样代码。假设你的数据集的标签列为 `label`，需要将该列作为参数输入： ```python import pandas as pd import numpy as np from sklearn.utils import resample def random_undersample(df, label_col): # 统计各类别数量 class_counts = df[label_col].value_counts() # 找到数量最少的类别 min_class_count = class_counts.min() # 对每个类别进行欠采样 df_undersampled = pd.DataFrame() for label in class_counts.index: # 如果该类别数量大于等于最少类别数量，则不进行欠采样 if class_counts[label] >= min_class_count: df_undersampled = pd.concat([df_undersampled, df[df[label_col]==label]]) # 否则进行欠采样 else: df_class = df[df[label_col]==label] df_undersampled = pd.concat([df_undersampled, resample(df_class, replace=False, n_samples=min_class_count)]) return df_undersampled ``` 使用方法： ```python # 读取数据集，label_col 为标签列名 df = pd.read_csv('your_dataset.csv') undersampled_df = random_undersample(df, 'label') # 打印处理前后的样本数量 print('Original dataset shape:', df.shape) print('Undersampled dataset shape:', undersampled_df.shape) ``` 其中，`resample` 函数是来自于 `sklearn.utils` 的欠采样函数，`replace=False` 表示不放回采样，`n_samples` 表示采样数量。

阅读全文