python 输入一句话,怎么判别该句话是在五个文件中的哪一个,用jieba库和词频统计,程序
时间: 2023-06-19 12:03:51 浏览: 106
python jieba分词及中文词频统计(csdn)————程序.pdf
5星 · 资源好评率100%
这里提供一种可能的实现方法:
1. 准备五个文件,分别为file1.txt, file2.txt, file3.txt, file4.txt, file5.txt。
2. 对于每个文件,先使用jieba库进行分词,然后统计每个词语出现的次数。
3. 将生成的词频统计结果保存到对应的文件中,文件名为file1_freq.txt, file2_freq.txt, file3_freq.txt, file4_freq.txt, file5_freq.txt。
4. 在判断时,将输入的句子也进行分词,并统计每个词语出现的次数。
5. 分别读取五个词频统计文件,计算输入句子与每个文件中的词频统计结果的余弦相似度。
6. 找到余弦相似度最高的文件,即可判断输入句子属于哪个文件。
下面是可能的代码实现:
```python
import jieba
import os
import math
# 定义文件名和路径
file_names = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt']
freq_names = ['file1_freq.txt', 'file2_freq.txt', 'file3_freq.txt', 'file4_freq.txt', 'file5_freq.txt']
file_paths = [os.path.join(os.getcwd(), file_name) for file_name in file_names]
freq_paths = [os.path.join(os.getcwd(), freq_name) for freq_name in freq_names]
# 分词和词频统计函数
def process_file(file_path, freq_path):
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
seg_list = jieba.cut(content)
word_freq = {}
for word in seg_list:
if word in word_freq:
word_freq[word] += 1
else:
word_freq[word] = 1
with open(freq_path, 'w', encoding='utf-8') as freq_file:
for k, v in word_freq.items():
freq_file.write('{} {}\n'.format(k, v))
# 分别处理五个文件
for file_path, freq_path in zip(file_paths, freq_paths):
process_file(file_path, freq_path)
# 输入句子
sentence = input('请输入句子:')
# 分词和词频统计
seg_list = jieba.cut(sentence)
word_freq = {}
for word in seg_list:
if word in word_freq:
word_freq[word] += 1
else:
word_freq[word] = 1
# 计算余弦相似度
max_similarity = -1
max_index = -1
for i, freq_path in enumerate(freq_paths):
with open(freq_path, 'r', encoding='utf-8') as freq_file:
freq_lines = freq_file.readlines()
freq_dict = {}
for line in freq_lines:
k, v = line.strip().split()
freq_dict[k] = int(v)
numerator = 0
denominator1 = 0
denominator2 = 0
for k, v in word_freq.items():
numerator += v * freq_dict.get(k, 0)
denominator1 += v ** 2
denominator2 += (freq_dict.get(k, 0)) ** 2
denominator = math.sqrt(denominator1) * math.sqrt(denominator2)
similarity = numerator / denominator if denominator != 0 else 0
if similarity > max_similarity:
max_similarity = similarity
max_index = i
# 输出结果
if max_index != -1:
print('该句子属于文件{}。'.format(file_names[max_index]))
else:
print('无法判断该句子属于哪个文件。')
```
需要注意的是,余弦相似度的计算方法仅适用于非负向量空间模型,而在实际应用中,文本常常需要进行归一化、去停用词等预处理。此外,该实现方法仅使用了单一的特征(即词频),而在实际应用中,常常需要使用多种特征并进行特征组合,如使用TF-IDF、词向量等。
阅读全文