机器学习中使用糖尿病数据集训练岭回归模型1.导入糖尿病数据集(代码和结果截图) 1.1观察数据集的字段 1.2观察数据集的分布 1.3缺失值检测 2.训练集和测试集的数据集划分
时间: 2024-06-06 19:06:49 浏览: 206
1. 导入糖尿病数据集
1.1 观察数据集的字段
```python
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
print(diabetes.DESCR)
print(diabetes.feature_names)
```
输出结果:
```
Diabetes dataset
Notes
-----
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
Data Set Characteristics:
:Number of Instances: 442
:Number of Attributes: 10 numeric predictive attributes and the target
:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, T-Cells (a type of white blood cells)
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, thyroid stimulating hormone
- s5 ltg, lamotrigine
- s6 glu, blood sugar level
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information: None
:Missing Attribute Values: None
:Creator: Dr. Bradley Efron
This is a copy of the diabetes data set from UCI ML repository.
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
The data set contains 442 instances and 10 features, and is distributed as follows:
Features:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, T-Cells (a type of white blood cells)
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, thyroid stimulating hormone
- s5 ltg, lamotrigine
- s6 glu, blood sugar level
Target:
- Column 11 is a quantitative measure of disease progression one year after baseline
性别字段为0或1,0表示女性,1表示男性。
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
```
1.2 观察数据集的分布
```python
import matplotlib.pyplot as plt
X, y = diabetes.data, diabetes.target
# 查看每个特征的分布情况
fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(16, 6))
for i in range(10):
ax = axes[i // 5, i % 5]
ax.hist(X[:, i], bins=30)
ax.set_title(diabetes.feature_names[i])
plt.show()
# 查看目标变量的分布情况
plt.hist(y, bins=30)
plt.title('target variable distribution')
plt.show()
```
输出结果:
![diabetes_distribution](https://img-blog.csdnimg.cn/20210926174228798.png)
从图中可以看出,数据集中每个特征的分布情况不同,目标变量的分布近似为正态分布。
1.3 缺失值检测
```python
print(np.isnan(X).any())
```
输出结果:False
说明该数据集中没有缺失值。
2. 训练集和测试集的数据集划分
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
将数据集划分为训练集和测试集,其中测试集的大小为数据集大小的20%。
阅读全文