python代码查重算法
时间: 2023-10-27 20:08:18 浏览: 334
文本查重算法是一种用于判断两个文本之间相似度的方法。在Python中,有多种算法可以实现文本查重,其中一种常用的算法是Simhash算法。Simhash算法基于哈希函数,将文本转化为一串二进制码,然后通过计算二进制码的汉明距离来判断文本之间的相似度。
以下是一个使用Python实现Simhash算法的代码示例:
```python
import hashlib
def simhash(text):
# 分词
words = text.split()
# 计算每个词的hash值
hash_list = [hashlib.md5(word.encode('utf-8')).hexdigest() for word in words]
# 计算加权总和
weight_sum = sum([int(hash[:8], 16) for hash in hash_list])
# 生成特征向量
feature_vector = ''.join(['1' if int(hash[:8], 16) >= weight_sum/len(words) else '0' for hash in hash_list])
return feature_vector
def similarity(text1, text2):
feature_vector1 = simhash(text1)
feature_vector2 = simhash(text2)
# 计算汉明距离
hamming_distance = sum([char1 != char2 for char1, char2 in zip(feature_vector1, feature_vector2)])
# 计算相似度
similarity = 1 - hamming_distance / len(feature_vector1)
return similarity
text1 = "This is a sample text for testing."
text2 = "This is another sample text for testing."
similarity_score = similarity(text1, text2)
print("相似度:%f" % similarity_score)
```
阅读全文