tfidf算法python源代码
时间: 2023-11-10 15:02:59 浏览: 143
2020数字中国创新大赛-数字政府赛道-智能算法赛:智慧海洋建设算法赛道代码+python源代码+具体文案文档说明
TF-IDF算法Python源代码如下:
```python
import math
from collections import Counter
from typing import List
def tf(word: str, document: List[str]) -> float:
return document.count(word) / len(document)
def idf(word: str, documents: List[List[str]]) -> float:
num_documents_containing_word = sum([1 for document in documents if word in document])
return math.log(len(documents) / num_documents_containing_word)
def tf_idf(word: str, document: List[str], documents: List[List[str]]) -> float:
return tf(word, document) * idf(word, documents)
def get_top_n_words(n: int, document: List[str], documents: List[List[str]]) -> List[str]:
words = set(document)
word_scores = [(word, tf_idf(word, document, documents)) for word in words]
sorted_word_scores = sorted(word_scores, key=lambda t: t[1], reverse=True)
top_n_words = [word for word, _ in sorted_word_scores[:n]]
return top_n_words
```
其中,tf用于计算某个单词在一个文档中出现的频率,idf用于计算某个单词在整个文档集中的逆文档频率,tf_idf则是二者的乘积,表示某个单词在一个文档中的重要性程度。get_top_n_words用于获取某个文档中TF-IDF值最高的前n个单词。
阅读全文