用python实现基于SGNS构建词向量:SGNS方法中窗口K=2,子词向量维数自定,获得向量vec_sgns。之后基于该子词向量计算wordsim353_agreed.txt中同一行中两个词的余弦相似度sim_sgns。当wordsim353_agreed.txt中某一个词没有获得向量时(该词未出现在该语料中),令其所在行的两个词之间的sim_sgns=0。
时间: 2023-11-22 22:53:39 浏览: 63
(python)使用余弦相似度算法计算两个文本的相似度的简单实现
5星 · 资源好评率100%
实现基于SGNS构建词向量的代码如下:
```python
import numpy as np
import pandas as pd
from collections import Counter
# 读取语料库
with open('corpus.txt', 'r', encoding='utf-8') as f:
corpus = f.read().split()
# 构建词典
vocab = set(corpus)
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}
vocab_size = len(vocab)
# 计算单词出现次数
word_counts = Counter(corpus)
# 构建共现矩阵
co_matrix = np.zeros((vocab_size, vocab_size))
window_size = 2
for idx, word in enumerate(corpus):
for i in range(1, window_size+1):
left_idx = idx - i
right_idx = idx + i
if left_idx >= 0:
left_word = corpus[left_idx]
co_matrix[word2idx[word], word2idx[left_word]] += 1
if right_idx < len(corpus):
right_word = corpus[right_idx]
co_matrix[word2idx[word], word2idx[right_word]] += 1
# 构建SGNS模型
embedding_size = 100
np.random.seed(123)
W = np.random.randn(vocab_size, embedding_size) / np.sqrt(embedding_size)
W_tilde = np.random.randn(vocab_size, embedding_size) / np.sqrt(embedding_size)
learning_rate = 0.1
epochs = 100
batch_size = 256
for epoch in range(epochs):
np.random.shuffle(corpus)
losses = []
for idx in range(0, len(corpus), batch_size):
batch = corpus[idx:idx+batch_size]
x = np.array([word2idx[word] for word in batch])
y = x.copy()
noise = np.random.choice(vocab_size, size=(len(batch), 5))
for i, noise_words in enumerate(noise):
while y[i] in noise_words:
noise_words = np.random.choice(vocab_size, size=5)
y[i] = noise_words[np.argmax([np.dot(W_tilde[noise_word], W[x[i]]) for noise_word in noise_words])]
x_vectors = W[x]
y_vectors = W_tilde[y]
z = np.dot(x_vectors, y_vectors.T)
probs = 1 / (1 + np.exp(-z))
error = probs - 1
dW = np.dot(error.T, y_vectors)
dW_tilde = np.dot(error, x_vectors)
W[x] -= learning_rate * dW
W_tilde[y] -= learning_rate * dW_tilde
losses.append(np.sum(-np.log(probs)))
print('Epoch: %d, Loss: %.4f' % (epoch+1, np.mean(losses)))
# 获得词向量
vec_sgns = W + W_tilde
```
接下来,计算wordsim353_agreed.txt中同一行中两个词的余弦相似度sim_sgns的代码如下:
```python
# 读取wordsim353_agreed.txt
word_pairs = []
with open('wordsim353_agreed.txt', 'r') as f:
for line in f:
w1, w2, score = line.strip().split('\t')
word_pairs.append((w1, w2, float(score)))
# 计算余弦相似度
sim_sgns = []
for w1, w2, score in word_pairs:
if w1 not in word2idx or w2 not in word2idx:
sim_sgns.append(0)
else:
vec1 = vec_sgns[word2idx[w1]]
vec2 = vec_sgns[word2idx[w2]]
sim_sgns.append(np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))))
```
最后sim_sgns中存储的就是wordsim353_agreed.txt中同一行中两个词的余弦相似度。如果某一个词没有获得向量,则令其所在行的两个词之间的sim_sgns为0。
阅读全文