利用python写出代码①读入binary.csv文件,其中gpa为在校成绩,gre为分数,rank为本科生母校的声望,admit为入学批准。把rank改名为prestige。查看数据描述性统计,每一列的标准差,prestige与admin的值相应的数量关系,并生成各参数的直方图。 ②利用statsmodels.formula,将prestige作为分类变量,利用smf进行预测,并将预测评分存入 predict 列中,设定预测值>0.5时表示预测被录取,计算预测录取数量,实际录取数量和预测命中率;分别用logit和ols进行测试,并比较预测效果。 ③利用sklearn,设为虚拟变量。除gre、gpa外,加入了上面常见的虚拟变量(注意,引入的虚拟变量列数应为虚拟变量总列数减1,减去的1列作为基准),利用sklean进行预测,预测值admit为0/1二值,计算计算预测录取数量,实际录取数量和预测命中率。分别用LogisticRegression和LogisticRegressionCv(其中cv=10)进行测试,并比较预测效果。 ④利用train_test_split把数据随机分为训练集和训练集,其中1/3为测试集,重做logit、ols、LogisticRegression和LogisticRegressionCV,比较其效果
时间: 2024-02-29 15:51:53 浏览: 158
各种导出脚本工具_oracle_oracle导出csv_oracle数据导出脚本_txt导出为dbf_
5星 · 资源好评率100%
以下是代码实现,其中用到了pandas、numpy、matplotlib、statsmodels、sklearn等库:
```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
# 读取数据
data = pd.read_csv('binary.csv')
data = data.rename(columns={'rank': 'prestige'})
# 描述性统计
print(data.describe())
# 标准差
print("Standard deviation of each column:")
print(data.std())
# prestige与admit的值相应的数量关系
print("Prestige vs. admit:")
print(data.groupby('prestige')['admit'].value_counts())
# 直方图
data.hist()
plt.show()
# 利用statsmodels进行预测
model = smf.logit('admit ~ prestige', data=data).fit()
data['predict'] = model.predict(data)
data['predict'] = np.where(data['predict'] > 0.5, 1, 0)
print("Logit results:")
print("Predicted admit count:", data['predict'].sum())
print("Actual admit count:", data['admit'].sum())
print("Hit rate:", sum(data['predict'] == data['admit'])/len(data))
model = smf.ols('admit ~ prestige', data=data).fit()
data['predict'] = model.predict(data)
data['predict'] = np.where(data['predict'] > 0.5, 1, 0)
print("OLS results:")
print("Predicted admit count:", data['predict'].sum())
print("Actual admit count:", data['admit'].sum())
print("Hit rate:", sum(data['predict'] == data['admit'])/len(data))
# 利用sklearn进行预测
enc = OneHotEncoder(drop='first')
prestige_dummy = pd.DataFrame(enc.fit_transform(data[['prestige']]).toarray())
prestige_dummy.columns = ['prestige_2', 'prestige_3', 'prestige_4']
X = pd.concat([data[['gre', 'gpa']], prestige_dummy], axis=1)
y = data['admit']
# LogisticRegression
model = LogisticRegression().fit(X, y)
data['predict'] = model.predict(X)
print("LogisticRegression results:")
print("Predicted admit count:", data['predict'].sum())
print("Actual admit count:", data['admit'].sum())
print("Hit rate:", sum(data['predict'] == data['admit'])/len(data))
# LogisticRegressionCV
model = LogisticRegressionCV(cv=10).fit(X, y)
data['predict'] = model.predict(X)
print("LogisticRegressionCV results:")
print("Predicted admit count:", data['predict'].sum())
print("Actual admit count:", data['admit'].sum())
print("Hit rate:", sum(data['predict'] == data['admit'])/len(data))
# 利用train_test_split进行测试
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# LogisticRegression
model = LogisticRegression().fit(X_train, y_train)
y_predict = model.predict(X_test)
print("LogisticRegression test results:")
print("Predicted admit count:", y_predict.sum())
print("Actual admit count:", y_test.sum())
print("Hit rate:", sum(y_predict == y_test)/len(y_test))
# LogisticRegressionCV
model = LogisticRegressionCV(cv=10).fit(X_train, y_train)
y_predict = model.predict(X_test)
print("LogisticRegressionCV test results:")
print("Predicted admit count:", y_predict.sum())
print("Actual admit count:", y_test.sum())
print("Hit rate:", sum(y_predict == y_test)/len(y_test))
# 利用statsmodels进行测试
model = smf.logit('admit ~ prestige', data=pd.concat([X_train, y_train], axis=1)).fit()
y_predict = model.predict(pd.concat([X_test, y_test], axis=1))
y_predict = np.where(y_predict > 0.5, 1, 0)
print("Logit test results:")
print("Predicted admit count:", y_predict.sum())
print("Actual admit count:", y_test.sum())
print("Hit rate:", sum(y_predict == y_test)/len(y_test))
model = smf.ols('admit ~ prestige', data=pd.concat([X_train, y_train], axis=1)).fit()
y_predict = model.predict(pd.concat([X_test, y_test], axis=1))
y_predict = np.where(y_predict > 0.5, 1, 0)
print("OLS test results:")
print("Predicted admit count:", y_predict.sum())
print("Actual admit count:", y_test.sum())
print("Hit rate:", sum(y_predict == y_test)/len(y_test))
```
注意:本代码仅供参考,具体实现需要根据实际情况进行调整。
阅读全文