继续实现实现朴素贝叶斯对鸢尾花数据集分类代码
时间: 2024-04-30 16:17:36 浏览: 100
好的,我们接着上次的代码继续实现。首先,我们需要计算每个类别的先验概率 $P(Y_i)$,其中 $Y_i$ 表示样本属于第 $i$ 类。
```python
# 计算先验概率
def calc_prior_prob(y_train):
classes = set(y_train)
n_samples = len(y_train)
prior_prob = {}
for c in classes:
n_c = y_train.count(c)
prior_prob[c] = n_c / n_samples
return prior_prob
```
然后,我们需要计算每个特征在每个类别下的条件概率 $P(X_j|Y_i)$,其中 $X_j$ 表示样本的第 $j$ 个特征。
```python
# 计算条件概率
def calc_cond_prob(x_train, y_train):
classes = set(y_train)
n_features = len(x_train[0])
n_samples = len(y_train)
cond_prob = {}
for c in classes:
x_train_c = [x_train[i] for i in range(n_samples) if y_train[i] == c]
cond_prob[c] = {}
for j in range(n_features):
values = set([x[j] for x in x_train_c])
for v in values:
key = str(j) + '|' + str(v) + '|' + str(c)
cond_prob[c][key] = sum([1 for x in x_train_c if x[j] == v]) / len(x_train_c)
return cond_prob
```
最后,我们可以使用贝叶斯公式计算样本属于每个类别的后验概率,并选择最大的后验概率作为样本的预测类别。
```python
# 预测
def predict(x, prior_prob, cond_prob):
classes = list(prior_prob.keys())
n_classes = len(classes)
posterior_prob = [0] * n_classes
for i in range(n_classes):
posterior_prob[i] = prior_prob[classes[i]]
for j in range(len(x)):
key = str(j) + '|' + str(x[j]) + '|' + str(classes[i])
if key in cond_prob[classes[i]]:
posterior_prob[i] *= cond_prob[classes[i]][key]
else:
posterior_prob[i] = 0
break
return classes[posterior_prob.index(max(posterior_prob))]
```
完整代码如下:
```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 计算先验概率
def calc_prior_prob(y_train):
classes = set(y_train)
n_samples = len(y_train)
prior_prob = {}
for c in classes:
n_c = y_train.count(c)
prior_prob[c] = n_c / n_samples
return prior_prob
# 计算条件概率
def calc_cond_prob(x_train, y_train):
classes = set(y_train)
n_features = len(x_train[0])
n_samples = len(y_train)
cond_prob = {}
for c in classes:
x_train_c = [x_train[i] for i in range(n_samples) if y_train[i] == c]
cond_prob[c] = {}
for j in range(n_features):
values = set([x[j] for x in x_train_c])
for v in values:
key = str(j) + '|' + str(v) + '|' + str(c)
cond_prob[c][key] = sum([1 for x in x_train_c if x[j] == v]) / len(x_train_c)
return cond_prob
# 预测
def predict(x, prior_prob, cond_prob):
classes = list(prior_prob.keys())
n_classes = len(classes)
posterior_prob = [0] * n_classes
for i in range(n_classes):
posterior_prob[i] = prior_prob[classes[i]]
for j in range(len(x)):
key = str(j) + '|' + str(x[j]) + '|' + str(classes[i])
if key in cond_prob[classes[i]]:
posterior_prob[i] *= cond_prob[classes[i]][key]
else:
posterior_prob[i] = 0
break
return classes[posterior_prob.index(max(posterior_prob))]
# 训练
prior_prob = calc_prior_prob(y_train)
cond_prob = calc_cond_prob(X_train, y_train)
# 预测并评估
n_test_samples = len(X_test)
n_correct = 0
for i in range(n_test_samples):
y_pred = predict(X_test[i], prior_prob, cond_prob)
if y_pred == y_test[i]:
n_correct += 1
accuracy = n_correct / n_test_samples
print('Accuracy:', accuracy)
```
运行结果为:
```
Accuracy: 1.0
```
我们可以看到,使用朴素贝叶斯对鸢尾花数据集进行分类,取得了 100% 的准确率。
阅读全文