用python实现中文相似度匹配算法
时间: 2023-09-17 11:09:59 浏览: 116
基于Python实现的多种相似度计算、匹配搜索算法,支持文本、图像,开箱即用,相似度计算、语义匹配搜索工具包
以下是Python实现中文相似度匹配算法的示例代码:
1. 余弦相似度算法
```python
import jieba
import numpy as np
def cosine_similarity(s1, s2):
# 对两个文本分词
words1 = jieba.cut(s1)
words2 = jieba.cut(s2)
# 将分词结果转化为set
words_set = set(words1).union(set(words2))
# 构建文本向量
v1 = np.zeros(len(words_set))
v2 = np.zeros(len(words_set))
i = 0
word_dict = {}
for word in words_set:
word_dict[word] = i
i += 1
for word in words1:
v1[word_dict[word]] += 1
for word in words2:
v2[word_dict[word]] += 1
# 计算余弦相似度
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
```
2. Jaccard相似度算法
```python
import jieba
def jaccard_similarity(s1, s2):
# 对两个文本分词
words1 = set(jieba.cut(s1))
words2 = set(jieba.cut(s2))
# 计算交集和并集
intersection = words1 & words2
union = words1 | words2
# 计算Jaccard相似度
return len(intersection) / len(union)
```
3. 编辑距离算法
```python
def edit_distance(s1, s2):
m, n = len(s1), len(s2)
# 定义二维数组dp
dp = [[0] * (n + 1) for _ in range(m + 1)]
# 初始化dp
for i in range(m + 1):
dp[i][0] = i
for j in range(n + 1):
dp[0][j] = j
# 动态规划计算编辑距离
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]:
dp[i][j] = dp[i - 1][j - 1]
else:
dp[i][j] = min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) + 1
# 返回编辑距离
return dp[m][n]
```
4. 词向量匹配算法
```python
import jieba
import gensim
def word2vec_similarity(s1, s2):
# 加载预训练的词向量模型
model = gensim.models.KeyedVectors.load_word2vec_format('pretrained_word2vec.bin', binary=True)
# 对两个文本分词
words1 = jieba.cut(s1)
words2 = jieba.cut(s2)
# 将分词结果转化为set
words_set = set(words1).union(set(words2))
# 计算两个文本的词向量之间的相似度
v1 = np.zeros(300)
v2 = np.zeros(300)
for word in words1:
if word in model:
v1 += model[word]
for word in words2:
if word in model:
v2 += model[word]
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
```
以上是一些常见的中文相似度匹配算法的Python实现示例代码。需要注意的是,对于词向量匹配算法,需要提前下载预训练好的词向量模型(如Word2Vec模型),并使用gensim库来加载和使用这些模型。
阅读全文