n-gram模型训练代码
时间: 2023-11-09 16:20:02 浏览: 45
以下是一个简单的n-gram模型训练代码示例,以bigram为例:
```python
from collections import defaultdict
# 读取文本文件,将每个句子分割成一系列单词
def read_corpus(file_path):
data = []
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
words = line.strip().split()
data.append(words)
return data
# 统计词频和bigram频率
def count_freq(data):
unigram_freq = defaultdict(int)
bigram_freq = defaultdict(int)
for sentence in data:
for i in range(len(sentence)):
# 统计词频
unigram_freq[sentence[i]] += 1
# 统计bigram频率
if i > 0:
bigram_freq[(sentence[i-1], sentence[i])] += 1
return unigram_freq, bigram_freq
# 计算bigram概率
def calc_bigram_prob(unigram_freq, bigram_freq):
bigram_prob = defaultdict(float)
for bigram, freq in bigram_freq.items():
prev_word = bigram[0]
bigram_prob[bigram] = freq / unigram_freq[prev_word]
return bigram_prob
# 训练模型
def train(file_path):
data = read_corpus(file_path)
unigram_freq, bigram_freq = count_freq(data)
bigram_prob = calc_bigram_prob(unigram_freq, bigram_freq)
return bigram_prob
# 测试
def test(model, sentence):
words = sentence.strip().split()
prob = 1.0
for i in range(1, len(words)):
prev_word = words[i-1]
curr_word = words[i]
prob *= model[(prev_word, curr_word)]
return prob
# 示例
if __name__ == '__main__':
file_path = 'corpus.txt'
model = train(file_path)
sentence = '我 爱 北京 天安门'
prob = test(model, sentence)
print('概率为:', prob)
```
在上述代码中,我们首先读取文本文件,统计每个词语的出现频率和bigram的出现频率。然后,我们根据频率计算bigram的概率。最后,我们使用测试语句计算bigram概率。