dont overfit ii 比赛的代码样本
时间: 2023-05-28 07:04:05 浏览: 49
以下是一个简单的Python代码样本,用于在Kaggle的"Don't Overfit II"比赛中建立逻辑回归模型。
```
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# Load data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
# Split data into X and y
X_train = train_df.drop(['id', 'target'], axis=1)
y_train = train_df['target']
X_test = test_df.drop('id', axis=1)
# Build logistic regression model
model = LogisticRegression(random_state=0, solver='lbfgs', max_iter=1000)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print("Mean AUC: {:.4f}".format(np.mean(scores)))
# Fit model to full training set
model.fit(X_train, y_train)
# Predict on test set and save submission file
y_pred = model.predict_proba(X_test)[:, 1]
submission_df = pd.DataFrame({'id': test_df['id'], 'target': y_pred})
submission_df.to_csv('submission.csv', index=False)
```
在这个代码样本中,我们首先从CSV文件中加载训练和测试数据。我们将训练数据分为特征(X_train)和目标变量(y_train),并从测试数据中删除ID列。接下来,我们使用逻辑回归算法建立模型,并使用交叉验证计算平均AUC值。最后,我们将模型拟合到完整的训练数据集上,并在测试集上进行预测,并将预测结果保存为CSV文件提交。