朴素贝叶斯做文本分类代码_百万级新闻文本分类实践(朴素贝叶斯,SVM)
时间: 2023-07-05 19:18:33 浏览: 77
以下是朴素贝叶斯做文本分类的代码:
```python
import os
import random
import jieba
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
def load_data():
"""
加载数据集
"""
data = []
labels = []
root = './data'
for label in os.listdir(root):
for file in os.listdir(os.path.join(root, label)):
with open(os.path.join(root, label, file), 'r', encoding='utf-8') as f:
content = f.read()
data.append(content)
labels.append(label)
return data, labels
def preprocess_data(data):
"""
数据预处理:分词
"""
preprocessed_data = []
for doc in data:
words = jieba.cut(doc)
preprocessed_data.append(' '.join(words))
return preprocessed_data
def split_data(data, labels, test_ratio=0.2):
"""
拆分数据集:训练集和测试集
"""
data_labels = list(zip(data, labels))
random.shuffle(data_labels)
data, labels = zip(*data_labels)
split_index = int(len(data) * (1 - test_ratio))
train_data = data[:split_index]
train_labels = labels[:split_index]
test_data = data[split_index:]
test_labels = labels[split_index:]
return train_data, train_labels, test_data, test_labels
def train_model(train_data, train_labels):
"""
训练模型:朴素贝叶斯
"""
vectorizer = TfidfVectorizer()
train_data = vectorizer.fit_transform(train_data)
model = MultinomialNB()
model.fit(train_data, train_labels)
return model, vectorizer
def evaluate_model(model, vectorizer, test_data, test_labels):
"""
评估模型性能
"""
test_data = vectorizer.transform(test_data)
accuracy = model.score(test_data, test_labels)
print('Accuracy:', accuracy)
if __name__ == '__main__':
# 加载数据集
data, labels = load_data()
# 数据预处理
preprocessed_data = preprocess_data(data)
# 拆分数据集
train_data, train_labels, test_data, test_labels = split_data(preprocessed_data, labels)
# 训练模型
model, vectorizer = train_model(train_data, train_labels)
# 评估模型性能
evaluate_model(model, vectorizer, test_data, test_labels)
```
以上代码使用朴素贝叶斯模型对新闻文本进行分类。首先,通过 `load_data` 函数加载数据集,然后通过 `preprocess_data` 函数对数据集进行预处理,即对每个文档进行分词。接着,通过 `split_data` 函数将数据集拆分为训练集和测试集。然后,通过 `train_model` 函数训练朴素贝叶斯模型。最后,通过 `evaluate_model` 函数评估模型的性能。
阅读全文