改进的TF-IDF算法,将IDF值改为自定义文件输入的每行热度权重python代码
时间: 2023-12-22 21:02:44 浏览: 81
以下是将IDF值改为自定义文件输入的每行热度权重的Python代码:
```
import math
def tfidf(docs):
# 创建词频字典
word_freq = {}
for doc in docs:
for word in doc.split():
if word not in word_freq:
word_freq[word] = 1
else:
word_freq[word] += 1
# 计算TF值
tf_values = []
for doc in docs:
doc_words = doc.split()
doc_freq = {}
for word in doc_words:
if word not in doc_freq:
doc_freq[word] = 1
else:
doc_freq[word] += 1
tf_dict = {}
for word in doc_words:
tf_dict[word] = doc_freq[word] / len(doc_words)
tf_values.append(tf_dict)
# 读取自定义热度权重文件
weight_dict = {}
with open("weight.txt") as f:
for line in f:
word, weight = line.strip().split()
weight_dict[word] = float(weight)
# 计算IDF值并乘以热度权重
idf_values = {}
for word in word_freq:
doc_containing_word = 0
for doc in docs:
if word in doc:
doc_containing_word += 1
idf_values[word] = math.log(len(docs) / (1 + doc_containing_word)) * weight_dict[word]
# 计算TF-IDF值
tfidf_values = []
for tf_dict in tf_values:
tfidf_dict = {}
for word in tf_dict:
tfidf_dict[word] = tf_dict[word] * idf_values[word]
tfidf_values.append(tfidf_dict)
return tfidf_values
```
其中,`docs`是一个包含多个文档的列表,每个文档是一个字符串。自定义热度权重文件`weight.txt`中每行包含一个词和一个浮点数,表示该词的热度权重值。`tfidf_values`是一个列表,其中每个元素代表一个文档的TF-IDF值,是一个字典,键为词,值为TF-IDF值。
阅读全文