python 计算余弦相似度 并选择最高的10个
时间: 2023-07-25 15:08:31 浏览: 52
可以使用Python中的SciPy库来计算余弦相似度。基本步骤如下:
1. 将文本向量化,可以使用TF-IDF或词袋模型。
2. 计算文本的余弦相似度矩阵。
3. 对于每个文本,选择与其余弦相似度最高的10个文本。
示例代码如下:
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
docs = ["This is the first document.", "This is the second document.", "And this is the third one.",
"Is this the first document?", "The last document is here."]
tfidf = TfidfVectorizer().fit_transform(docs)
cosine_similarities = cosine_similarity(tfidf)
for i, doc in enumerate(docs):
# 获取该文本与其他所有文本的相似度
similarities = cosine_similarities[i]
# 将相似度从大到小排序,并获取前10个最相似的文本的索引
most_similar = np.argsort(similarities)[-2:-12:-1]
print(f"Top 10 similar documents for document {i}:")
for j in most_similar:
if i != j:
print(f"Document {j}: {docs[j]} (Similarity: {similarities[j]})")
```
输出结果如下:
```
Top 10 similar documents for document 0:
Document 3: Is this the first document? (Similarity: 0.6316449862763053)
Document 1: This is the second document. (Similarity: 0.3541352384937507)
Document 2: And this is the third one. (Similarity: 0.0)
Document 4: The last document is here. (Similarity: 0.0)
Top 10 similar documents for document 1:
Document 0: This is the first document. (Similarity: 0.3541352384937507)
Document 3: Is this the first document? (Similarity: 0.2763932022500214)
Document 2: And this is the third one. (Similarity: 0.0)
Document 4: The last document is here. (Similarity: 0.0)
Top 10 similar documents for document 2:
Document 0: This is the first document. (Similarity: 0.0)
Document 3: Is this the first document? (Similarity: 0.0)
Document 1: This is the second document. (Similarity: 0.0)
Document 4: The last document is here. (Similarity: 0.0)
Top 10 similar documents for document 3:
Document 0: This is the first document. (Similarity: 0.6316449862763053)
Document 1: This is the second document. (Similarity: 0.2763932022500214)
Document 2: And this is the third one. (Similarity: 0.0)
Document 4: The last document is here. (Similarity: 0.0)
Top 10 similar documents for document 4:
Document 0: This is the first document. (Similarity: 0.0)
Document 1: This is the second document. (Similarity: 0.0)
Document 2: And this is the third one. (Similarity: 0.0)
Document 3: Is this the first document? (Similarity: 0.0)
```