n-gram语言模型代码python
时间: 2023-10-12 22:14:55 浏览: 60
以下是一个使用Python实现的简单n-gram语言模型代码:
```python
import nltk
from collections import defaultdict
# 准备数据
text = "this is a test sentence. this sentence is for testing purposes."
tokens = nltk.word_tokenize(text.lower())
# 定义一个函数来生成n-grams
def generate_ngrams(tokens, n):
ngrams = []
for i in range(len(tokens)-n+1):
ngrams.append(" ".join(tokens[i:i+n]))
return ngrams
# 生成所有的unigrams、bigrams和trigrams
unigrams = generate_ngrams(tokens, 1)
bigrams = generate_ngrams(tokens, 2)
trigrams = generate_ngrams(tokens, 3)
# 计算每个n-gram的频率
unigram_freq = defaultdict(int)
for unigram in unigrams:
unigram_freq[unigram] += 1
bigram_freq = defaultdict(int)
for bigram in bigrams:
bigram_freq[bigram] += 1
trigram_freq = defaultdict(int)
for trigram in trigrams:
trigram_freq[trigram] += 1
# 计算每个n-gram的概率
unigram_prob = {}
for unigram in unigram_freq:
unigram_prob[unigram] = unigram_freq[unigram] / len(unigrams)
bigram_prob = {}
for bigram in bigram_freq:
bigram_prob[bigram] = bigram_freq[bigram] / unigram_freq[bigram.split()[0]]
trigram_prob = {}
for trigram in trigram_freq:
trigram_prob[trigram] = trigram_freq[trigram] / bigram_freq[" ".join(trigram.split()[:2])]
# 使用模型计算句子的概率
test_sentence = "this is a test sentence."
test_tokens = nltk.word_tokenize(test_sentence.lower())
test_bigrams = generate_ngrams(test_tokens, 2)
prob = 1.0
for bigram in test_bigrams:
if bigram in bigram_prob:
prob *= bigram_prob[bigram]
else:
prob *= unigram_prob[bigram.split()[1]]
print(f"The probability of the sentence '{test_sentence}' is {prob}.")
```
这个代码使用nltk库来对文本进行标记化,然后使用Python的collections模块中的defaultdict来计算n-gram的频率和概率。最后,该模型可以使用生成的概率来计算给定句子的概率。