特征工程需要分箱的接口
时间: 2023-10-08 21:05:25 浏览: 43
分箱是一种特征工程技术,它可以将连续值离散化成不同的区间,从而使得特征在模型训练中更易于处理。在Python中,可以使用pandas和numpy库来实现分箱的接口。
下面是一个使用pandas库的分箱示例:
```python
import pandas as pd
# 创建一个包含连续值的DataFrame
df = pd.DataFrame({'age': [20, 25, 30, 35, 40, 45, 50, 55, 60, 65], 'income': [20000, 35000, 50000, 65000, 80000, 95000, 110000, 125000, 140000, 155000]})
# 将age特征分成3个区间
df['age_bin'] = pd.cut(df['age'], bins=3)
# 将income特征分成4个区间
df['income_bin'] = pd.cut(df['income'], bins=4)
# 打印结果
print(df)
```
输出结果:
```
age income age_bin income_bin
0 20 20000 (19.945, 36.0] (19960.0, 54250.0]
1 25 35000 (19.945, 36.0] (19960.0, 54250.0]
2 30 50000 (19.945, 36.0] (19960.0, 54250.0]
3 35 65000 (36.0, 52.0] (54250.0, 88333.333]
4 40 80000 (36.0, 52.0] (54250.0, 88333.333]
5 45 95000 (36.0, 52.0] (88333.333, 122500.0]
6 50 110000 (52.0, 68.0] (88333.333, 122500.0]
7 55 125000 (52.0, 68.0] (122500.0, 156666.667]
8 60 140000 (52.0, 68.0] (122500.0, 156666.667]
9 65 155000 (52.0, 68.0] (122500.0, 156666.667]
```
上述代码中,我们使用了`pd.cut()`函数将`age`和`income`特征分别分成了3个和4个区间。分箱后,新的特征`age_bin`和`income_bin`被添加到了DataFrame中。