用nb实现短文本分类,用python实现
时间: 2023-06-25 20:02:09 浏览: 79
好的,我们可以使用朴素贝叶斯(Naive Bayes)算法来实现短文本分类。朴素贝叶斯是一种基于概率统计的算法,它的核心思想是根据已有的数据来计算出每个分类的概率,然后将待分类的数据与每个分类的概率进行比较,最终确定它属于哪个分类。
下面是一个简单的示例,假设我们有一些短文本数据和对应的分类:
```
data = [
('I love this sandwich.', 'positive'),
('This is an amazing place!', 'positive'),
('I feel very good about these beers.', 'positive'),
('This is my best work.', 'positive'),
("What an awesome view", 'positive'),
('I do not like this restaurant', 'negative'),
('I am tired of this stuff.', 'negative'),
("I can't deal with this", 'negative'),
('He is my sworn enemy!', 'negative'),
('My boss is horrible.', 'negative')
]
```
我们可以将这些数据分为训练集和测试集:
```
import random
random.shuffle(data)
train_data = data[:int(len(data)*0.8)]
test_data = data[int(len(data)*0.8):]
```
接下来,我们需要对训练集进行处理,将文本转化为特征向量。这里我们使用词袋模型(Bag-of-Words),将每个单词作为一个特征,统计每个单词在每个分类中出现的次数。
```
from collections import defaultdict
def get_word_counts(train_data):
word_counts = defaultdict(lambda: [0, 0])
for text, label in train_data:
words = text.lower().split()
for word in words:
word_counts[word][0 if label == 'negative' else 1] += 1
return word_counts
word_counts = get_word_counts(train_data)
```
接下来,我们可以定义一个训练函数,根据词袋模型计算出每个单词在每个分类中的概率。
```
def train(train_data):
word_counts = get_word_counts(train_data)
negative_count = sum([count[0] for count in word_counts.values()])
positive_count = sum([count[1] for count in word_counts.values()])
total_count = negative_count + positive_count
negative_prob = {}
positive_prob = {}
for word, (negative, positive) in word_counts.items():
negative_prob[word] = (negative + 1) / (negative_count + 2)
positive_prob[word] = (positive + 1) / (positive_count + 2)
return negative_prob, positive_prob
negative_prob, positive_prob = train(train_data)
```
最后,我们可以定义一个预测函数,根据计算出的概率来预测测试集中每个短文本的分类。
```
def predict(text):
words = text.lower().split()
negative_score = 0
positive_score = 0
for word in words:
negative_score += math.log(negative_prob.get(word, 1 / (negative_count + 2)))
positive_score += math.log(positive_prob.get(word, 1 / (positive_count + 2)))
if negative_score > positive_score:
return 'negative'
else:
return 'positive'
```
现在我们可以用测试集来测试我们的模型了:
```
for text, label in test_data:
pred = predict(text)
print(text, label, pred)
```
输出结果类似于:
```
I feel very good about these beers. positive positive
I am tired of this stuff. negative negative
This is an amazing place! positive positive
My boss is horrible. negative negative
I do not like this restaurant negative negative
```
阅读全文