python写基线特征表代码
时间: 2023-10-14 22:07:22 浏览: 247
以下是一个可能的 Python 基线特征表代码的示例:
```python
import pandas as pd
# 读取数据
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
# 特征工程
def feature_engineering(data):
data['age_category'] = pd.cut(data['age'], bins=[0, 18, 30, 45, 60, 100], labels=['0-18', '18-30', '30-45', '45-60', '60+'])
data['fare_category'] = pd.cut(data['fare'], bins=[0, 10, 30, 100, 600], labels=['0-10', '10-30', '30-100', '100+'])
data['family_size'] = data['sibsp'] + data['parch'] + 1
data['is_alone'] = data['family_size'].apply(lambda x: 1 if x == 1 else 0)
data['title'] = data['name'].str.split(',').str[1].str.split('.').str[0].str.strip()
data['cabin_letter'] = data['cabin'].str[0]
return data
train_data = feature_engineering(train_data)
test_data = feature_engineering(test_data)
# 选择特征
selected_features = ['pclass', 'sex', 'age_category', 'fare_category', 'embarked', 'family_size', 'is_alone', 'title', 'cabin_letter']
# 填充缺失值
train_data[selected_features] = train_data[selected_features].fillna('Unknown')
test_data[selected_features] = test_data[selected_features].fillna('Unknown')
# 构建特征表
train_features = pd.get_dummies(train_data[selected_features], columns=selected_features)
test_features = pd.get_dummies(test_data[selected_features], columns=selected_features)
# 输出特征表
train_features.to_csv('train_features.csv', index=False)
test_features.to_csv('test_features.csv', index=False)
```
该代码做了以下工作:
1. 读取训练集和测试集;
2. 进行特征工程,包括创建年龄和票价的类别变量、家庭大小和是否独自旅行的新变量、提取姓名中的称号和舱位号的首字母;
3. 选择需要的特征;
4. 填充缺失值;
5. 对特征进行独热编码;
6. 输出特征表。
特征表可以用于训练机器学习模型。
阅读全文