再利用pearson系数,找到对光伏发电功率影响最大的(3-4)个特征,并重新完成上面的3-5项
时间: 2024-09-19 11:04:57 浏览: 31
Pearson系数是用来衡量两个变量之间的线性相关程度,适合于分析数值型特征间的关联。为了找出对光伏发电功率影响最大的3到4个特征,我们可以按如下步骤操作:
1. 计算特征与发电功率的Pearson系数:
```python
from scipy.stats import pearsonr
import pandas as pd
def calculate_pearson_coefficients(data, target_column):
correlations = {}
for column in data.columns[:-1]: # 不包括最后一列(目标变量)
corr, _ = pearsonr(data[column], data[target_column])
correlations[column] = corr
return sorted(correlations.items(), key=lambda x: abs(x[1]), reverse=True)[:3] # 取前3到4个最大绝对值的系数
top_features = calculate_pearson_coefficients(df, "光伏功率")
```
2. 确定重要特征并重新划分数据集:
```python
selected_columns = [df.columns[i] for i, _ in top_features]
X_train_selected, X_test_selected, y_train, y_test = train_test_split(df[selected_columns], df["光伏功率"], test_size=0.2, random_state=42)
```
3. 使用重要特征重新构建模型:
```python
reduced_model = LinearRegression() # 或者尝试别的模型如Lasso、Ridge
reduced_model.fit(X_train_selected, y_train)
```
4. 重新评估模型:
```python
y_pred_train_reduced = reduced_model.predict(X_train_selected)
score_train_reduced = r2_score(y_train, y_pred_train_reduced)
print(f"Training R^2 score with selected features: {score_train_reduced}")
y_pred_test_reduced = reduced_model.predict(X_test_selected)
score_test_reduced = r2_score(y_test, y_pred_test_reduced)
print(f"Testing R^2 score with selected features: {score_test_reduced}")
```
5. 绘制预测与实际值的对比图:
```python
# 用新选出的特征绘制预测结果
plt.scatter(y_test, y_pred_test_reduced, alpha=0.5)
# ...(同上一步,使用new_predictions代替y_pred_test)
```
阅读全文