基于如UCI Machine Learning Repository中的Spambase数据集 写出一份关于预测电子邮件是否为垃圾邮件的Python代码。给出详细且正确的Python代码。
时间: 2024-09-15 16:13:47 浏览: 59
当然可以,以下是一个简单的例子,使用Python的scikit-learn库基于UCI Machine Learning Repository中的Spambase数据集预测电子邮件是否为垃圾邮件。假设我们已经下载了CSV文件并将其命名为'spambase_dataset.csv'。这份代码包含了数据加载、预处理以及构建和训练朴素贝叶斯分类器的过程:
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
# 1. 加载数据
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data" # 数据集链接
data = pd.read_csv(url, header=None, names=['word_freq_make', 'word_freq_address', ..., 'label']) # 更改列名以匹配实际数据集
# 2. 数据预处理
# 确保所有数值列都进行了标准化,以便Naive Bayes能更好地工作
scaler = StandardScaler()
data[numerical_columns] = scaler.fit_transform(data[numerical_columns])
# 将label列转换为二进制形式(0表示非垃圾邮件,1表示垃圾邮件)
data['label'] = data['label'].map({'ham': 0, 'spam': 1})
# 划分训练集和测试集
X = data.drop('label', axis=1)
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. 构建和训练模型
gnb = GaussianNB() # 使用高斯朴素贝叶斯分类器
gnb.fit(X_train, y_train)
# 4. 进行预测
y_pred = gnb.predict(X_test)
# 5. 评估模型性能
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
conf_mat = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)
print("Confusion Matrix:")
print(conf_mat)
#
阅读全文