python朴素贝叶斯分类
时间: 2024-12-26 21:27:26 浏览: 14
### 使用 Python 实现朴素贝叶斯分类
#### 创建朴素贝叶斯模型
为了创建一个简单的朴素贝叶斯分类器,可以利用 `sklearn` 库中的 `MultinomialNB` 或者编写自定义函数来处理特定的数据集。下面展示了一个基于文本分类的例子。
```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np
# 准备训练数据
training_data = ["Chinese Beijing Chinese", "Chinese Chinese Shanghai",
"Chinese Macao", "Tokyo Japan Chinese"]
labels = ['C', 'C', 'C', 'J']
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(training_data)
classifier = MultinomialNB(alpha=1.0, fit_prior=True)
classifier.fit(X_train, labels)
```
这段代码展示了如何准备训练样本以及构建特征向量矩阵的过程[^2]。这里使用了词袋模型作为输入给定的文档集合转换成数值型表示形式以便于后续建模操作。
#### 测试与评估
一旦完成了上述准备工作之后就可以开始测试新实例所属类别:
```python
test_data = ["Chinese Chinese Chinese Tokyo Japan"]
X_test = vectorizer.transform(test_data)
predicted_label = classifier.predict(X_test)[0]
print(f"The predicted label is {predicted_label}")
```
此部分实现了对未知样例进行预测的功能,并打印出最终的结果。值得注意的是,在实际应用场景下还需要考虑更多因素比如交叉验证、超参数调整等以提高泛化能力[^3]。
#### 自定义实现
如果想要更深入理解内部机制,则可以从零开始手写一个简易版本如下所示:
```python
def train_naive_baye(train_set, classes):
feature_probabilities = {}
total_features_per_class = {}
class_counts = {cls: sum(1 for doc in train_set if cls == c) for c in set(classes)}
num_of_docs = len(train_set)
unique_words = list(set(word for document in train_set for word in document.split()))
for current_class in set(classes):
words_in_this_class = []
for index, text in enumerate(train_set):
if classes[index] == current_class:
words_in_this_class.extend(text.split())
count_dict = {word: 0 for word in unique_words}
for w in words_in_this_class:
if w not in count_dict.keys():
continue
count_dict[w] += 1
smoothed_probs = {
k : (v + 1)/(len(words_in_this_class)+len(unique_words))
for k,v in count_dict.items()}
feature_probabilities[current_class] = smoothed_probs
total_features_per_class[current_class] = len(words_in_this_class)
prior_probability = {c:(class_counts[c]/num_of_docs) for c in set(classes)}
return {'prior': prior_probability,
'likelihoods':feature_probabilities,
'total_feature_count_by_class':total_features_per_class}
def predict_single_doc(document,test_model):
max_posterior=-float('inf')
best_guess=None
split_document=document.strip().split()
for category in test_model['prior'].keys():
log_likelihood_sum=sum(
np.log((test_model['likelihoods'][category].get(w,0)) or 1e-9)\
for w in split_document )
posterior=np.log(test_model['prior'][category])+log_likelihood_sum
if(posterior>max_posterior):
max_posterior=posterior
best_guess=category
return best_guess
train_dataset=["I love programming","programming makes me happy","happy coding everyone"]
categories=['positive','neutral','positive']
trained_model=train_naive_baye(train_dataset,categories)
new_sentence="coding brings joy"
prediction=predict_single_doc(new_sentence, trained_model)
print(prediction)
```
该段脚本提供了一种不依赖第三方库的方法去学习条件概率分布并据此做出决策。注意平滑技术的应用使得即使某些单词从未出现在某个类别的训练集中也能赋予极小的可能性而非完全排除在外[^4]。
阅读全文