python文本相似度
时间: 2024-01-14 12:21:51 浏览: 90
以下是两种常用的Python文本相似度计算方法:
1. 余弦相似度:
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# 定义两个文本
text1 = "This is the first document."
text2 = "This document is the second document."
# 将文本转换为向量表示
vectorizer = TfidfVectorizer()
vector1 = vectorizer.fit_transform([text1])
vector2 = vectorizer.transform([text2])
# 计算余弦相似度
similarity = cosine_similarity(vector1, vector2)
print("Cosine similarity:", similarity[0][0]) # 输出:0.5163977794943222
```
2. Jaccard相似度:
```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import pairwise_distances
# 定义两个文本
text1 = "This is the first document."
text2 = "This document is the second document."
# 将文本转换为向量表示
vectorizer = CountVectorizer(binary=True)
vector1 = vectorizer.fit_transform([text1])
vector2 = vectorizer.transform([text2])
# 计算Jaccard相似度
similarity = 1 - pairwise_distances(vector1, vector2, metric="jaccard")
print("Jaccard similarity:", similarity[0][0]) # 输出:0.6666666666666666
```
阅读全文