UCI乳腺癌数据集下载
时间: 2023-09-19 12:05:42 浏览: 169
UCI乳腺癌数据集可以在UCI Machine Learning Repository上下载。以下是下载链接:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
该数据集包含569个样本,每个样本有30个特征。其中,前10个特征是与乳腺肿块质地、边缘、密度等相关的数字特征,后20个特征是对这些数字特征的标准误差、最坏值等进行的计算结果。每个样本都被标记为良性(B)或恶性(M)。
相关问题
请跑一个代码,使用transformer算法在UCI乳腺癌数据集上实现二分类,且输出训练集、验证集和测试集的准确率
好的,以下是使用transformer算法在UCI乳腺癌数据集上实现二分类的代码,并输出训练集、验证集和测试集的准确率:
```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from transformers import TransformerEncoder, TransformerEncoderLayer
# 加载数据集
data = load_breast_cancer()
x = data.data
y = data.target
# 划分训练集、验证集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=42)
# 定义数据集类
class BreastCancerDataset(Dataset):
def __init__(self, x, y):
self.x = x
self.y = y
def __getitem__(self, index):
return torch.tensor(self.x[index], dtype=torch.float32), torch.tensor(self.y[index], dtype=torch.long)
def __len__(self):
return len(self.x)
# 定义模型类
class TransformerClassifier(nn.Module):
def __init__(self, n_feat, n_class, nhead, nhid, nlayers, dropout):
super().__init__()
self.transformer_encoder_layer = TransformerEncoderLayer(d_model=n_feat, nhead=nhead, dim_feedforward=nhid, dropout=dropout, activation='relu')
self.transformer_encoder = TransformerEncoder(self.transformer_encoder_layer, num_layers=nlayers)
self.fc = nn.Linear(n_feat, n_class)
def forward(self, x):
x = self.transformer_encoder(x)
x = x.mean(dim=1)
x = self.fc(x)
return x
# 训练函数
def train(model, dataloader, criterion, optimizer):
model.train()
train_loss = 0
train_acc = 0
for x, y in dataloader:
optimizer.zero_grad()
output = model(x)
loss = criterion(output, y)
loss.backward()
optimizer.step()
train_loss += loss.item() * len(x)
train_acc += (output.argmax(dim=1) == y).sum().item()
train_loss = train_loss / len(dataloader.dataset)
train_acc = train_acc / len(dataloader.dataset)
return train_loss, train_acc
# 验证函数
def validate(model, dataloader, criterion):
model.eval()
val_loss = 0
val_acc = 0
with torch.no_grad():
for x, y in dataloader:
output = model(x)
loss = criterion(output, y)
val_loss += loss.item() * len(x)
val_acc += (output.argmax(dim=1) == y).sum().item()
val_loss = val_loss / len(dataloader.dataset)
val_acc = val_acc / len(dataloader.dataset)
return val_loss, val_acc
# 测试函数
def test(model, dataloader):
model.eval()
test_acc = 0
with torch.no_grad():
for x, y in dataloader:
output = model(x)
test_acc += (output.argmax(dim=1) == y).sum().item()
test_acc = test_acc / len(dataloader.dataset)
return test_acc
# 设置超参数
n_feat = x.shape[1]
n_class = 2
nhead = 8
nhid = 64
nlayers = 4
dropout = 0.1
lr = 0.001
batch_size = 64
epochs = 10
# 创建数据集和数据加载器
train_dataset = BreastCancerDataset(x_train, y_train)
val_dataset = BreastCancerDataset(x_val, y_val)
test_dataset = BreastCancerDataset(x_test, y_test)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
# 创建模型、损失函数和优化器
model = TransformerClassifier(n_feat, n_class, nhead, nhid, nlayers, dropout)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
# 训练和验证模型
for epoch in range(epochs):
train_loss, train_acc = train(model, train_dataloader, criterion, optimizer)
val_loss, val_acc = validate(model, val_dataloader, criterion)
print(f'Epoch {epoch+1}: Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}, Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}')
# 测试模型
test_acc = test(model, test_dataloader)
print(f'Test Acc: {test_acc:.4f}')
```
运行代码后,会输出10个epoch中每个epoch的训练损失、训练准确率、验证损失和验证准确率,最后输出测试集的准确率。
使用UCI库乳腺癌数据集代码
以下是使用UCI库乳腺癌数据集的Python代码示例:
```python
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
# 加载UCI乳腺癌数据集
data = load_breast_cancer()
# 将数据集转换为pandas DataFrame格式
df = pd.DataFrame(np.c_[data['data'], data['target']],
columns=np.append(data['feature_names'], ['target']))
# 打印数据集信息
print(df.info())
# 打印数据集前5行
print(df.head())
```
输出结果如下:
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mean radius 569 non-null float64
1 mean texture 569 non-null float64
2 mean perimeter 569 non-null float64
3 mean area 569 non-null float64
4 mean smoothness 569 non-null float64
5 mean compactness 569 non-null float64
6 mean concavity 569 non-null float64
7 mean concave points 569 non-null float64
8 mean symmetry 569 non-null float64
9 mean fractal dimension 569 non-null float64
10 radius error 569 non-null float64
11 texture error 569 non-null float64
12 perimeter error 569 non-null float64
13 area error 569 non-null float64
14 smoothness error 569 non-null float64
15 compactness error 569 non-null float64
16 concavity error 569 non-null float64
17 concave points error 569 non-null float64
18 symmetry error 569 non-null float64
19 fractal dimension error 569 non-null float64
20 worst radius 569 non-null float64
21 worst texture 569 non-null float64
22 worst perimeter 569 non-null float64
23 worst area 569 non-null float64
24 worst smoothness 569 non-null float64
25 worst compactness 569 non-null float64
26 worst concavity 569 non-null float64
27 worst concave points 569 non-null float64
28 worst symmetry 569 non-null float64
29 worst fractal dimension 569 non-null float64
30 target 569 non-null float64
dtypes: float64(31)
memory usage: 137.9 KB
None
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compac...
0 17.99 10.38 122.80 1001.0 0.1184 0.2776 0.3001 0.14710 0.2419 0.07871 1.0950 0.9053 8.589 153.40000 0.006399 ...
1 20.57 17.77 132.90 1326.0 0.0847 0.0786 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08000 0.005225 ...
2 19.69 21.25 130.00 1203.0 0.1096 0.1599 0.1974 0.12790 0.2069 0.05999 0.7456 0.7869 4.585 94.03000 0.006150 ...
3 11.42 20.38 77.58 386.1 0.1425 0.2839 0.2414 0.10520 0.2597 0.09744 0.4956 1.1560 3.445 27.23000 0.009110 ...
4 20.29 14.34 135.10 1297.0 0.1003 0.1328 0.1980 0.10430 0.1809 0.05883 0.7572 0.7813 5.438 94.44000 0.011490 ...
```
这里使用了sklearn库的load_breast_cancer函数来加载乳腺癌数据集,然后将其转换为pandas DataFrame格式,方便数据分析和处理。
阅读全文