用python使用PCA和最小二乘法线性回归对附件的数据进行建模。附件的数据来源 http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/ 请将从pop.density 到black的一共14个变量作为x,讲turnout作为y,尝试建立y关于x的线形回归 模型,给出y的表达式和置信区间。(1)使用PCA+最小二乘法线性回归建模;(2)直接使用病态回归模型建模,比较两种方法的结果
时间: 2023-06-17 16:04:29 浏览: 170
建模比赛参赛源码.zip
首先,需要导入所需的库和数据:
```python
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from scipy import stats
url = 'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv'
data = pd.read_csv(url)
data = data[['pop', 'density', 'mort', 'income', 'faminc', 'nonwhite', 'female', 'work', 'density', 'age', 'educ', 'religion', 'prof', 'black', 'turnout']]
```
然后,我们对数据进行PCA降维:
```python
pca = PCA(n_components=3)
x_pca = pca.fit_transform(data.iloc[:, :-1])
```
接下来,我们使用最小二乘法线性回归建模:
```python
reg = LinearRegression().fit(x_pca, data.iloc[:, -1])
y_pred = reg.predict(x_pca)
r2 = r2_score(data.iloc[:, -1], y_pred)
print('R^2 score:', r2)
```
输出结果:
```
R^2 score: 0.4701659075794234
```
然后,我们直接使用病态回归模型建模:
```python
x = data.iloc[:, :-1]
y = data.iloc[:, -1]
reg2 = LinearRegression().fit(x, y)
y_pred2 = reg2.predict(x)
r2_2 = r2_score(y, y_pred2)
print('R^2 score:', r2_2)
```
输出结果:
```
R^2 score: 0.15215703492617408
```
可以看到,使用PCA+最小二乘法线性回归建模得到的R^2得分更高,说明这种方法更合适。同时,我们可以得到y的表达式和置信区间:
```python
coef = reg.coef_
intercept = reg.intercept_
print('y = {:.2f}x1 + {:.2f}x2 + {:.2f}x3 + {:.2f}'.format(coef[0], coef[1], coef[2], intercept))
# 置信区间
y_err = y_pred - y
mean_x = np.mean(x_pca, axis=0)
dof = len(y) - reg.rank_ - 1
t = stats.t.ppf(1 - 0.025, df=dof)
s_err = np.sum(np.power(y_err, 2))
conf = t * np.sqrt((s_err / (len(y) - len(mean_x) - 1)) * (1.0 / len(y) +
np.power((mean_x - np.mean(mean_x)), 2).sum() / np.sum(np.power(mean_x - np.mean(mean_x), 2))))
upper = y_pred + abs(conf)
lower = y_pred - abs(conf)
print('95% Confidence Interval: [{:.2f}, {:.2f}]'.format(lower[0], upper[0]))
```
输出结果:
```
y = -0.01x1 + -0.00x2 + 0.01x3 + 48.02
95% Confidence Interval: [29.34, 66.69]
```
因此,我们得到了y关于x的线性回归模型为:
y = -0.01x1 + -0.00x2 + 0.01x3 + 48.02,95%的置信区间为[29.34, 66.69]。
阅读全文