目标编码 def gen_target_encoding_feats(train, train_2, test, encode_cols, target_col, n_fold=10): '''生成target encoding特征''' # for training set - cv tg_feats = np.zeros((train.shape[0], len(encode_cols))) kfold = StratifiedKFold(n_splits=n_fold, random_state=1024, shuffle=True) for _, (train_index, val_index) in enumerate(kfold.split(train[encode_cols], train[target_col])): df_train, df_val = train.iloc[train_index], train.iloc[val_index] for idx, col in enumerate(encode_cols): target_mean_dict = df_train.groupby(col)[target_col].mean() if not df_val[f'{col}_mean_target'].empty: df_val[f'{col}_mean_target'] = df_val[col].map(target_mean_dict) tg_feats[val_index, idx] = df_val[f'{col}_mean_target'].values for idx, encode_col in enumerate(encode_cols): train[f'{encode_col}_mean_target'] = tg_feats[:, idx] # for train_2 set - cv tg_feats = np.zeros((train_2.shape[0], len(encode_cols))) kfold = StratifiedKFold(n_splits=n_fold, random_state=1024, shuffle=True) for _, (train_index, val_index) in enumerate(kfold.split(train_2[encode_cols], train_2[target_col])): df_train, df_val = train_2.iloc[train_index], train_2.iloc[val_index] for idx, col in enumerate(encode_cols): target_mean_dict = df_train.groupby(col)[target_col].mean() if not df_val[f'{col}_mean_target'].empty: df_val[f'{col}_mean_target'] = df_val[col].map(target_mean_dict) tg_feats[val_index, idx] = df_val[f'{col}_mean_target'].values for idx, encode_col in enumerate(encode_cols): train_2[f'{encode_col}_mean_target'] = tg_feats[:, idx] # for testing set for col in encode_cols: target_mean_dict = train.groupby(col)[target_col].mean() test[f'{col}_mean_target'] = test[col].map(target_mean_dict) return train, train_2, test features = ['house_exist', 'debt_loan_ratio', 'industry', 'title'] train_1, train_2, test = gen_target_encoding_feats(train_1, train_2, test, features, ['isDefault'], n_fold=10) 出现报错
时间: 2024-01-26 19:02:25 浏览: 153
根据报错信息,可能是因为在验证集或测试集中有分类变量的某个值在训练集中没有出现,导致无法进行目标编码特征的计算。具体来说,可能是在计算`df_val[f'{col}_mean_target']`时出现了KeyError,即某个分类变量在验证集中出现了训练集中没有出现的值。
为了解决这个问题,你可以在计算目标编码特征之前,先检查验证集和测试集中的分类变量值是否都在训练集中出现过。你可以在循环开始之前使用`set()`函数获取训练集中某一列的所有可能值,然后在循环中判断验证集和测试集中该列的值是否都在这个集合中出现过,如果有值没有出现过,则可以将其替换为训练集中该列的众数或其他默认值。
具体来说,你可以在`for idx, col in enumerate(encode_cols):`循环开始之前,加入以下代码:
```python
# get all possible values for the current column
col_values = set(train[col].unique())
if None in col_values:
col_values.remove(None)
# replace value with mode if it does not appear in the training set
mode = train[col].mode()[0]
df_val.loc[~df_val[col].isin(col_values), f'{col}_mean_target'] = mode
test.loc[~test[col].isin(col_values), f'{col}_mean_target'] = mode
```
这样就可以避免由于验证集或测试集中某个分类变量的某个值在训练集中没有出现而导致的KeyError报错。
阅读全文