构造朴素贝叶斯分类器习题
时间: 2025-01-06 20:28:19 浏览: 9
### 构造朴素贝叶斯分类器
#### 计算先验概率和条件概率
为了构建一个拉普拉斯修正的朴素贝叶斯分类器,首要任务是计算每个类别的先验概率以及给定某个类别时每个特征的概率。对于每一个可能的类别 \( c \),需要统计训练集中属于该类的数据量占总数据的比例作为先验概率 \( P(c) \)[^2]。
接着针对每一类中的各个离散型特征 \( X_i \),记录不同取值出现次数,并加上平滑项1再除以该类样本数加特征所有可能取值数目之和,以此获得对应的条件概率 \( P(X_i | c) \)[^4]。此方法有助于防止当测试实例含有未曾在训练阶段观察到过的特征组合时产生的零概率问题。
#### 编写Python代码实现
下面给出一段简单的Python代码用于创建并应用上述提到的带有拉普拉斯校正机制的朴素贝叶斯分类模型:
```python
import numpy as np
from collections import defaultdict, Counter
class NaiveBayesClassifier:
def __init__(self):
self.class_prior_ = {}
self.cond_prob_ = {}
def fit(self, X_train, y_train):
n_samples = len(y_train)
# Calculate class priors with Laplace smoothing.
classes, counts = np.unique(y_train, return_counts=True)
for cls, cnt in zip(classes, counts):
self.class_prior_[cls] = (cnt + 1) / float(n_samples + len(classes))
# Initialize conditional probabilities dictionary.
feature_sets = {frozenset(x): set() for x in X_train}
unique_features_per_class = {
cls: defaultdict(lambda: len(feature_sets))
for cls in classes
}
# Count occurrences of each value per attribute within its own category.
grouped_data = [[X_train[i], y_train[i]] for i in range(len(X_train))]
for features, label in grouped_data:
for idx, feat_val in enumerate(features):
unique_features_per_class[label][idx].add(feat_val)
cond_probs = defaultdict(Counter)
for sample, target in zip(X_train, y_train):
for index, attr_value in enumerate(sample):
cond_probs[(target, index)][attr_value] += 1
# Apply Laplace correction when computing conditionals.
for key in cond_probs.keys():
total_count = sum(cond_probs[key].values())
num_unique_vals = unique_features_per_class[key[0]][key[1]]
for val in list(unique_features_per_class[key[0]][key[1]]) or ['unknown']:
count_for_this_val = cond_probs[key][val]
smoothed_p = (count_for_this_val + 1)/(total_count + num_unique_vals)
self.cond_prob_[(key[0], key[1]), val] = smoothed_p
def predict_proba(self, X_test):
predictions = []
for test_sample in X_test:
scores = [(cls, self._compute_score(test_sample, cls))
for cls in self.class_prior_.keys()]
max_cls, _ = max(scores, key=lambda item:item[1])
predictions.append(max_cls)
return predictions
def _compute_score(self, instance, current_class):
score = np.log(self.class_prior_[current_class])
for pos, elem in enumerate(instance):
try:
prob = self.cond_prob_[(current_class, pos), elem]
score += np.log(prob)
except KeyError:
continue
return score
```
这段程序定义了一个名为`NaiveBayesClassifier` 的类,实现了拟合(`fit`) 和预测 (`predict_proba`) 方法来处理输入数据集 `X_train`, `y_train`. 使用了字典结构存储各类别下的条件概率表,在遇到未知属性值时自动赋予极低但非零的可能性[^1].
阅读全文