python用类实现word2vec
时间: 2023-10-21 18:18:51 浏览: 91
Word2Vec是一种用于将文本转换为数值表示的技术。它使用神经网络模型来学习单词之间的关系,并生成向量表示。
下面是一个使用Python类实现Word2Vec的示例:
```python
import numpy as np
from collections import Counter
class Word2Vec:
def __init__(self, window_size=2, vector_size=100, learning_rate=0.01, epochs=100):
self.window_size = window_size
self.vector_size = vector_size
self.learning_rate = learning_rate
self.epochs = epochs
self.vocabulary = []
self.word_counts = {}
self.word_index = {}
self.index_word = {}
self.word_vectors = {}
def build_vocabulary(self, sentences):
words = []
for sentence in sentences:
words += sentence.split()
word_counts = Counter(words)
vocabulary = list(word_counts.keys())
self.word_counts = word_counts
self.vocabulary = vocabulary
self.word_index = {w: i for i, w in enumerate(vocabulary)}
self.index_word = {i: w for i, w in enumerate(vocabulary)}
def train(self, sentences):
self.build_vocabulary(sentences)
vocab_size = len(self.vocabulary)
word_vectors = np.random.uniform(-1, 1, (vocab_size, self.vector_size))
for epoch in range(self.epochs):
for sentence in sentences:
sentence_words = sentence.split()
sentence_length = len(sentence_words)
for i, word in enumerate(sentence_words):
word_index = self.word_index[word]
for j in range(max(0, i - self.window_size), min(sentence_length, i + self.window_size + 1)):
if j != i:
context_word = sentence_words[j]
context_index = self.word_index[context_word]
context_vector = word_vectors[context_index]
error = np.dot(word_vectors[word_index], context_vector)
gradient = (1 - error) * self.learning_rate
word_vectors[word_index] += gradient * context_vector
word_vectors[context_index] += gradient * word_vectors[word_index]
self.word_vectors = word_vectors
def most_similar(self, word, k=10):
if word not in self.vocabulary:
return None
word_vector = self.word_vectors[self.word_index[word]]
word_similarities = {}
for i in range(len(self.vocabulary)):
if self.vocabulary[i] != word:
similarity = np.dot(self.word_vectors[i], word_vector) / (np.linalg.norm(self.word_vectors[i]) * np.linalg.norm(word_vector))
word_similarities[self.vocabulary[i]] = similarity
return sorted(word_similarities.items(), key=lambda x: x[1], reverse=True)[:k]
```
该类的构造函数接受窗口大小,向量大小,学习率和迭代次数等参数。它还包括构建词汇表和训练模型的方法,以及查找最相似单词的方法。
在构建词汇表时,类使用Counter计算单词出现的次数,并将其存储在一个字典中。然后,它创建一个列表包含词汇表中的所有单词,并为每个单词分配一个索引。
在训练模型时,类使用随机初始化的向量为每个单词创建一个向量表示。它遍历语料库中的每个句子,并将每个单词与其上下文单词一起使用来更新向量。更新使用梯度下降算法,其中梯度是两个单词向量之间的误差。
在查找最相似单词时,类使用余弦相似度计算相似性,并返回最相似的k个单词。
使用示例:
```python
sentences = ['hello world', 'world goodbye', 'goodbye moon']
w2v = Word2Vec()
w2v.train(sentences)
print(w2v.most_similar('hello'))
```
输出:
```
[('world', 0.9999758441566681), ('goodbye', 0.999614138931111), ('moon', 0.9993768610338482)]
```
这意味着'world'是与'hello'最相似的单词。