计算data1.txt 、data2.txt 、data3.txt这三篇文档之间的相似度。这三篇文档见第三次实验压缩包。用python代码实现
时间: 2024-06-15 18:08:54 浏览: 84
基于Python实现的多种相似度计算、匹配搜索算法,支持文本、图像,开箱即用,相似度计算、语义匹配搜索工具包
要计算三篇文档之间的相似度,可以使用文本相似度算法,比如余弦相似度。以下是使用Python代码计算文档相似度的示例:
```python
import os
import re
import math
def calculate_similarity(doc1, doc2):
# 读取文档内容
with open(doc1, 'r', encoding='utf-8') as f:
content1 = f.read()
with open(doc2, 'r', encoding='utf-8') as f:
content2 = f.read()
# 文本预处理,去除标点符号和特殊字符
pattern = re.compile(r'[^\w\s]')
content1 = re.sub(pattern, '', content1)
content2 = re.sub(pattern, '', content2)
# 分词
words1 = content1.split()
words2 = content2.split()
# 构建词频向量
vector1 = {}
vector2 = {}
for word in words1:
vector1[word] = vector1.get(word, 0) + 1
for word in words2:
vector2[word] = vector2.get(word, 0) + 1
# 计算余弦相似度
dot_product = 0.0
norm1 = 0.0
norm2 = 0.0
for word in vector1:
dot_product += vector1[word] * vector2.get(word, 0)
norm1 += vector1[word] ** 2
for word in vector2:
norm2 += vector2[word] ** 2
similarity = dot_product / (math.sqrt(norm1) * math.sqrt(norm2))
return similarity
# 文档路径
doc1 = 'data1.txt'
doc2 = 'data2.txt'
doc3 = 'data3.txt'
# 计算相似度
similarity_1_2 = calculate_similarity(doc1, doc2)
similarity_1_3 = calculate_similarity(doc1, doc3)
similarity_2_3 = calculate_similarity(doc2, doc3)
print(f"文档1和文档2的相似度:{similarity_1_2}")
print(f"文档1和文档3的相似度:{similarity_1_3}")
print(f"文档2和文档3的相似度:{similarity_2_3}")
```
请确保将代码中的`data1.txt`、`data2.txt`、`data3.txt`替换为实际的文档路径。运行代码后,会输出三篇文档之间的相似度。
阅读全文