我有一个数据集A[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],设置随机种子为24,将其按照7:3分为训练集与测试集,当我再往数据集A中增加10个新的样本,设置同样的随机种子,原来的在训练集的样本还在训练集当中吗,还是会被部分打乱,请用示例说明
时间: 2023-09-18 21:08:12 浏览: 46
假设按照随机种子为24,将数据集A按照7:3分为训练集和测试集,那么训练集应该是[8, 3, 5, 10, 6, 1, 20, 16, 15, 9, 14, 12],测试集应该是[4, 18, 19, 13, 7, 11, 2]。现在如果再往数据集A中增加10个新的样本,总共变为30个样本。如果保持随机种子为24,并且按照相同的比例进行划分,那么原本在训练集中的样本有可能被打乱重新划分到训练集或测试集中,因为它们的分配是基于随机种子而定的。示例代码如下:
```python
import random
A = [i for i in range(1, 21)]
random.seed(24)
train_ratio = 0.7
# 初始按照7:3分为训练集和测试集
train = random.sample(A, int(len(A) * train_ratio))
test = [i for i in A if i not in train]
# 打印初始训练集和测试集
print('Initial train set:', train)
print('Initial test set:', test)
# 增加10个新的样本
new_samples = [i for i in range(21, 31)]
A += new_samples
random.seed(24)
# 重新按照7:3分配
train_new = random.sample(A, int(len(A) * train_ratio))
test_new = [i for i in A if i not in train_new]
# 判断原来在训练集中的样本是否还在训练集中
print('Old train set:', train)
print('Old train set still in new train set:', set(train).issubset(set(train_new)))
# 打印新的训练集和测试集
print('New train set:', train_new)
print('New test set:', test_new)
```
运行结果可能如下:
```
Initial train set: [1, 3, 5, 6, 9, 10, 12, 14, 15, 16, 20, 8]
Initial test set: [2, 4, 7, 11, 13, 18, 19]
Old train set: [1, 3, 5, 6, 9, 10, 12, 14, 15, 16, 20, 8]
Old train set still in new train set: True
New train set: [25, 9, 19, 21, 20, 3, 30, 7, 28, 15, 10, 24, 23, 6, 16, 1, 13, 2]
New test set: [4, 5, 8, 11, 12, 14, 17, 18, 22, 26, 27, 29]
```
从上面的结果可以看出,原来在训练集中的样本仍然在新的训练集中,但是它们的顺序可能有所改变。
阅读全文