用R语言实现印第安人糖尿病的朴素贝叶斯KNN分类并画图考察其训练误差与交叉验证,计算预测准确率
时间: 2024-06-09 15:09:46 浏览: 100
首先,我们需要准备数据集。这里我们使用UCI Machine Learning Repository的Pima Indians Diabetes数据集。
```R
# 导入数据集
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data")
# 将数据集分为训练集和测试集
set.seed(123)
trainIndex <- sample(1:nrow(data), 0.7*nrow(data))
train <- data[trainIndex,]
test <- data[-trainIndex,]
```
接下来,我们需要对数据进行预处理,将数据集分为特征和标签,并且将特征进行标准化。
```R
# 分离特征和标签
train_x <- train[,-9]
train_y <- train[,9]
test_x <- test[,-9]
test_y <- test[,9]
# 特征标准化
train_x <- scale(train_x)
test_x <- scale(test_x)
```
然后,我们可以使用knn和naivebayes库分别建立knn和朴素贝叶斯模型,并计算训练误差和交叉验证误差。
```R
library(knn)
library(naivebayes)
# KNN模型
knn_model <- knn(train_x, test_x, train_y, k = 5)
# 计算训练误差
train_error_knn <- mean(train_y != knn(train_x, train_x, train_y, k = 5))
# 计算交叉验证误差
cv_error_knn <- knn.cv(train_x, train_y, k = 5)$cv.error
# 朴素贝叶斯模型
nb_model <- naive_bayes(train_x, train_y)
# 计算训练误差
train_error_nb <- mean(train_y != predict(nb_model, train_x))
# 计算交叉验证误差
cv_error_nb <- cv_performance(nb_model)$error.rate
# 输出结果
cat("KNN模型的训练误差为:", train_error_knn, "\n")
cat("KNN模型的交叉验证误差为:", cv_error_knn, "\n")
cat("朴素贝叶斯模型的训练误差为:", train_error_nb, "\n")
cat("朴素贝叶斯模型的交叉验证误差为:", cv_error_nb, "\n")
```
最后,我们可以使用混淆矩阵计算预测准确率。
```R
# 计算预测准确率
library(caret)
confusionMatrix(knn_model, test_y)$overall['Accuracy']
confusionMatrix(predict(nb_model, test_x), test_y)$overall['Accuracy']
```
我们可以将训练误差和交叉验证误差绘制成图表,以便更好地观察模型的性能表现。
```R
library(ggplot2)
# 绘制训练误差和交叉验证误差
error_data <- data.frame(
model = c("KNN", "Naive Bayes"),
train_error = c(train_error_knn, train_error_nb),
cv_error = c(cv_error_knn, cv_error_nb)
)
error_data <- melt(error_data, id.vars = "model", variable.name = "error_type", value.name = "error")
ggplot(error_data, aes(x = model, y = error, fill = error_type)) +
geom_bar(stat = "identity", position = "dodge") +
theme(legend.position = "top") +
labs(title = "Training error and cross-validation error", x = "Model", y = "Error rate")
```
这样,我们就完成了印第安人糖尿病数据集的朴素贝叶斯KNN分类,并且计算了训练误差、交叉验证误差和预测准确率。
阅读全文