python 读取csv文件 在每个类别中选取等比例训练样本和测试样本 并划分为数据和标签
时间: 2023-12-14 15:36:45 浏览: 31
可以使用 pandas 库来读取 csv 文件,然后使用 sklearn 库的 train_test_split 函数来划分训练集和测试集。
假设你的 csv 文件长这个样子:
```
category,feature1,feature2,feature3,label
A,1,2,3,1
B,4,5,6,0
A,7,8,9,1
C,10,11,12,0
```
代码如下:
```python
import pandas as pd
from sklearn.model_selection import train_test_split
# 读取 csv 文件
data = pd.read_csv("your_file.csv")
# 获取类别列表
categories = data["category"].unique()
# 定义训练集和测试集的比例
train_ratio = 0.8
test_ratio = 0.2
# 分别对每个类别进行处理
for category in categories:
# 获取该类别的数据
category_data = data.loc[data["category"] == category]
# 获取数据和标签
X = category_data[["feature1", "feature2", "feature3"]]
y = category_data["label"]
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_ratio)
# 根据训练集和测试集的比例,进一步划分训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=1 - train_ratio)
# 输出划分结果
print(f"Category {category}:")
print(f"Train set: {X_train.shape}, {y_train.shape}")
print(f"Validation set: {X_val.shape}, {y_val.shape}")
print(f"Test set: {X_test.shape}, {y_test.shape}")
```
上面代码的输出结果类似于:
```
Category A:
Train set: (1, 3), (1,)
Validation set: (1, 3), (1,)
Test set: (1, 3), (1,)
Category B:
Train set: (1, 3), (1,)
Validation set: (1, 3), (1,)
Test set: (1, 3), (1,)
Category C:
Train set: (1, 3), (1,)
Validation set: (1, 3), (1,)
Test set: (1, 3), (1,)
```
其中,每个类别的训练集、验证集和测试集的大小都是根据训练集和测试集的比例计算出来的。