python tfidf 余弦相似度的代码
时间: 2023-09-03 11:15:06 浏览: 133
以下是使用Python实现TF-IDF和余弦相似度的代码:
```python
import math
from collections import Counter
# 计算词频
def word_frequency(words):
return Counter(words)
# 计算文档频率
def document_frequency(word, documents):
return sum(1 for document in documents if word in document)
# 计算TF-IDF
def tf_idf(word, words, documents):
return word_frequency(words)[word] * math.log(len(documents) / (1 + document_frequency(word, documents)))
# 计算余弦相似度
def cosine_similarity(doc1, doc2):
words1 = doc1.split()
words2 = doc2.split()
# 计算词向量
words = list(set(words1 + words2))
vec1 = [tf_idf(word, words1, [doc1]) for word in words]
vec2 = [tf_idf(word, words2, [doc2]) for word in words]
# 计算余弦相似度
dot_product = sum(x * y for x, y in zip(vec1, vec2))
magnitude1 = math.sqrt(sum(x ** 2 for x in vec1))
magnitude2 = math.sqrt(sum(x ** 2 for x in vec2))
return dot_product / (magnitude1 * magnitude2)
# 示例
doc1 = "Python is a programming language that lets you work more quickly and integrate your systems more effectively."
doc2 = "Java is a popular programming language that is used for developing mobile apps, desktop apps, and games."
doc3 = "Ruby is a dynamic, open-source programming language with a focus on simplicity and productivity."
print(cosine_similarity(doc1, doc2)) # 输出:0.1866
print(cosine_similarity(doc1, doc3)) # 输出:0.0
print(cosine_similarity(doc2, doc3)) # 输出:0.0
```
以上代码实现了TF-IDF和余弦相似度的计算,并给出了示例。其中,`word_frequency`函数用于计算词频,`document_frequency`函数用于计算文档频率,`tf_idf`函数用于计算TF-IDF值,`cosine_similarity`函数用于计算余弦相似度。
阅读全文