编写一段python代码,完成对UCI的皮马人糖尿病数据集diabetes数据集的读取,并利用决策树对其建模(diabetes.csv)(提高组)
时间: 2024-11-09 22:26:47 浏览: 21
皮马印第安人糖尿病数据集
首先,你需要安装必要的库,如pandas用于数据处理,matplotlib用于数据可视化,以及sklearn用于构建决策树模型。如果你还没有安装,可以使用pip命令:
```bash
pip install pandas matplotlib scikit-learn
```
接下来,我会提供一个Python代码片段,展示如何读取CSV文件并使用决策树模型对糖尿病数据集进行建模:
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# 1. 读取数据
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
column_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] # 列名
data = pd.read_csv(url, names=column_names)
# 2. 数据预处理
# 对于分类变量,通常会将类别编码(one-hot encoding),这里假设最后一列'class'是目标变量
X = data.iloc[:, :-1] # 特征
y = data['class'] # 目标变量
# 3. 数据分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80%训练,20%测试
# 4. 构建决策树模型
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
# 5. 预测并评估
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the test set: {accuracy * 100:.2f}%")
# 6. 可视化决策树(仅做示例,实际应用中可能需要)
try:
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None, filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph.view()
except ImportError:
print("You need to install graphviz for visualizing the decision tree.")
print("To install, use `pip install graphviz` and then run this code again.")
阅读全文