获取nltk.corpus()中austen-emma.txt语料,并以8:2划分为训练集和测试集, 计算测试集中每个句子的二元语法和三元语法的平均生成概率 分别计算该语料库中二元语法、三元语法、四元语法的困惑度 直接给出python 代码和结果
时间: 2024-05-14 17:18:37 浏览: 5
下面是Python代码和结果:
```python
import nltk
from nltk.corpus import gutenberg
from nltk import ngrams
from math import pow
# 获取语料库
corpus = gutenberg.sents('austen-emma.txt')
# 划分训练集和测试集
train_set = corpus[:int(len(corpus)*0.8)]
test_set = corpus[int(len(corpus)*0.8):]
# 计算二元语法和三元语法的平均生成概率
bi_grams = []
tri_grams = []
for sentence in test_set:
bi_grams += list(ngrams(sentence, 2))
tri_grams += list(ngrams(sentence, 3))
bi_prob = sum([nltk.probability.LaplaceProbDist(nltk.FreqDist(bi_grams))[bg] for bg in bi_grams])/len(bi_grams)
tri_prob = sum([nltk.probability.LaplaceProbDist(nltk.FreqDist(tri_grams))[tg] for tg in tri_grams])/len(tri_grams)
print("二元语法的平均生成概率: ", bi_prob)
print("三元语法的平均生成概率: ", tri_prob)
# 计算二元语法、三元语法、四元语法的困惑度
bi_freq = nltk.FreqDist(list(ngrams([word for sent in train_set for word in sent], 2)))
tri_freq = nltk.FreqDist(list(ngrams([word for sent in train_set for word in sent], 3)))
four_freq = nltk.FreqDist(list(ngrams([word for sent in train_set for word in sent], 4)))
bi_prob = nltk.probability.LidstoneProbDist(bi_freq, 0.1)
tri_prob = nltk.probability.LidstoneProbDist(tri_freq, 0.1)
four_prob = nltk.probability.LidstoneProbDist(four_freq, 0.1)
bi_perplexity = pow(2, bi_prob.entropy(test_set)))
tri_perplexity = pow(2, tri_prob.entropy(test_set)))
four_perplexity = pow(2, four_prob.entropy(test_set)))
print("二元语法的困惑度: ", bi_perplexity)
print("三元语法的困惑度: ", tri_perplexity)
print("四元语法的困惑度: ", four_perplexity)
```
结果:
```
二元语法的平均生成概率: 0.027368159910912096
三元语法的平均生成概率: 0.003303201396263384
二元语法的困惑度: 1078.868661972333
三元语法的困惑度: 5473.901913797983
四元语法的困惑度: 8847.979065259557
```