将这段代码完善
时间: 2024-02-12 09:06:32 浏览: 54
好的,我可以帮你将代码完善。在进行模型训练和预测时,一般需要进行交叉验证、网格搜索和模型评估等步骤,以得到更好的模型性能。下面是一个完整的代码示例:
``` python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# 读取数据
data = pd.read_csv("weather.csv")
# 数据预处理:将week列进行One-Hot编码
encoder = OneHotEncoder(sparse=False)
week_encoded = encoder.fit_transform(data[['week']])
week_encoded = pd.DataFrame(week_encoded, columns=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
data = pd.concat([data, week_encoded], axis=1)
# 特征选择
features = ['year', 'month', 'day', 'temp_2', 'temp_1', 'average', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
X = data[features]
y = data['actual']
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 决策树模型
dt_model = DecisionTreeRegressor(random_state=42)
dt_scores = cross_val_score(dt_model, X_train, y_train, cv=5, scoring='r2')
dt_mean_score = np.mean(dt_scores)
print("决策树模型的交叉验证得分为:", dt_scores)
print("决策树模型的平均交叉验证得分为:", dt_mean_score)
# 随机森林模型
rf_model = RandomForestRegressor(random_state=42)
rf_scores = cross_val_score(rf_model, X_train, y_train, cv=5, scoring='r2')
rf_mean_score = np.mean(rf_scores)
print("随机森林模型的交叉验证得分为:", rf_scores)
print("随机森林模型的平均交叉验证得分为:", rf_mean_score)
# 网格搜索调参
params = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(rf_model, params, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_score = grid_search.best_score_
print("随机森林模型的最佳参数为:", best_params)
print("随机森林模型的最佳交叉验证得分为:", best_score)
# 训练最优模型并评估性能
best_rf_model = RandomForestRegressor(n_estimators=best_params['n_estimators'],
max_depth=best_params['max_depth'],
min_samples_split=best_params['min_samples_split'],
min_samples_leaf=best_params['min_samples_leaf'],
random_state=42)
best_rf_model.fit(X_train, y_train)
y_pred = best_rf_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("随机森林模型的MAE为:", mae)
print("随机森林模型的MSE为:", mse)
print("随机森林模型的RMSE为:", rmse)
print("随机森林模型的R2得分为:", r2)
# 预测某一天的最高气温
prediction = best_rf_model.predict([[2022, 9, 14, 23, 25, 28, 0, 0, 0, 0, 1, 0, 0]])
print("预测的最高气温为:", prediction)
```
这段代码中,我们首先导入了需要的库和数据集,然后使用One-Hot编码将week列进行了处理,将其转换为了七个二进制列。接着选择了多个特征作为输入X,真实的最高气温作为输出y。使用train_test_split函数将数据集划分为训练集和测试集,然后分别使用决策树和随机森林进行交叉验证和调参,并计算了交叉验证得分和最佳参数。最后,使用最优的随机森林模型对测试集进行预测,并评估了模型的性能。
需要注意的是,这里的评估指标包括了平均绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)和决定系数(R2得分),可以根据具体的需求选择不同的评估指标。
阅读全文