用python实现基于SVD分解构成词向量:获取高维distributional表示时K=5,SVD降维后的维数自定,获得子词向量vec_sta。之后基于该向量计算wordsim353_agreed.txt中同一行中两个子词的余弦相似度sim_svd。当wordsim353_agreed.txt中某一个词没有获得向量时(该词未出现在该语料中),令其所在行的两个词之间的sim_svd=0。
时间: 2023-11-22 15:52:43 浏览: 248
好的,这里提供一份代码示例供您参考:
```python
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
# 读取数据
with open('wordsim353_agreed.txt', 'r') as f:
lines = f.readlines()
# 构建词频统计表
word_count = {}
for line in lines[1:]:
w1, w2, sim = line.strip().split('\t')
word_count[w1] = word_count.get(w1, 0) + 1
word_count[w2] = word_count.get(w2, 0) + 1
# 构建词-索引映射表和索引-词映射表
word2idx = {}
idx2word = {}
idx = 0
for w in word_count.keys():
word2idx[w] = idx
idx2word[idx] = w
idx += 1
# 构建共现矩阵
window_size = 5
matrix = np.zeros((len(word_count), len(word_count)))
for line in lines[1:]:
w1, w2, sim = line.strip().split('\t')
if w1 not in word2idx or w2 not in word2idx:
continue
i = word2idx[w1]
j = word2idx[w2]
matrix[i][j] += 1
matrix[j][i] += 1
for i in range(len(word_count)):
matrix[i][i] += word_count[idx2word[i]]
# SVD分解
K = 5
svd = TruncatedSVD(n_components=K)
svd.fit(matrix)
U = svd.transform(matrix)
# 构建子词向量
vec_sta = {}
for i in range(len(word_count)):
word = idx2word[i]
vec_sta[word] = U[i]
# 计算余弦相似度
sim_svd = {}
for line in lines[1:]:
w1, w2, sim = line.strip().split('\t')
if w1 not in vec_sta or w2 not in vec_sta:
sim_svd[(w1, w2)] = 0
else:
vec1 = vec_sta[w1]
vec2 = vec_sta[w2]
sim_svd[(w1, w2)] = cosine_similarity([vec1], [vec2])[0][0]
```
其中,`wordsim353_agreed.txt`是包含相似度评估的文件,每行格式为`word1 word2 similarity`,第一行为列名。`K`为SVD降维后的维数,这里设为5。`window_size`为构建共现矩阵时的窗口大小,这里设为5。最终输出的`vec_sta`是一个字典,包含了每个词的子词向量,`sim_svd`是一个字典,包含了每行两个词的余弦相似度。
阅读全文