机器学习贝叶斯乳腺癌数据处理代码
时间: 2024-07-08 21:00:53 浏览: 132
在使用机器学习处理贝叶斯乳腺癌(Breast Cancer Wisconsin)数据时,通常需要经过以下几个步骤的预处理:
1. **数据加载和探索**:
```python
import pandas as pd
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
print(df.head())
```
2. **数据清洗**:
检查是否有缺失值或异常值,并根据需要进行处理。
```python
if df.isnull().sum().sum() > 0:
# 填充缺失值或删除含有缺失值的行
df.fillna(df.mean(), inplace=True)
```
3. **特征缩放**:
对于数值特征,通常会归一化或标准化,以便所有特征在同一尺度上。
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[df.columns[:-1]] = scaler.fit_transform(df[df.columns[:-1]])
```
4. **拆分数据集**:
划分训练集和测试集。
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42)
```
5. **特征选择/编码**:
如有必要,对类别特征进行独热编码或其他编码方法。
```python
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
categorical_features = [col for col in df.columns if df[col].dtype == 'object']
encoded_data = pd.get_dummies(df, columns=categorical_features)
```
6. **模型构建和训练**:
选择合适的机器学习算法,如逻辑回归、随机森林、支持向量机等,训练模型。
```python
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
```
7. **评估模型**:
使用测试集评估模型性能。
```python
from sklearn.metrics import accuracy_score, confusion_matrix
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_mat = confusion_matrix(y_test, y_pred)
```