cbow模型代码实现
时间: 2023-11-16 12:53:10 浏览: 118
中英文语料训练CBOW模型获得词向量(pytorch实现)
CBOW(Continuous Bag-of-Words)模型是一种用于自然语言处理的神经网络模型,它的输入是上下文中的词汇,输出是中心词汇。CBOW 模型的代码实现如下:
```python
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda
# 设置超参数
vocab_size = 5000
embedding_dim = 100
window_size = 2
# 构建语料库
corpus = [
'I like playing soccer with my friends',
'He enjoys playing basketball on weekends',
'The cat loves drinking milk every morning'
]
# 将语料库转化为词向量矩阵
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(corpus)
sequences = tokenizer.texts_to_sequences(corpus)
# 构建训练数据集
x_train = []
y_train = []
for sequence in sequences:
for i in range(window_size, len(sequence) - window_size):
context = sequence[i-window_size:i] + sequence[i+1:i+window_size+1]
target = sequence[i]
x_train.append(context)
y_train.append(target)
x_train = np.array(x_train)
y_train = to_categorical(y_train, num_classes=vocab_size)
# 构建 CBOW 模型
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=window_size * 2))
model.add(Lambda(lambda x: np.mean(x, axis=1), output_shape=(embedding_dim,)))
model.add(Dense(vocab_size, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# 训练 CBOW 模型
model.fit(x_train, y_train, epochs=50, verbose=1)
# 输出词向量矩阵
embeddings = model.get_weights()[0]
print(embeddings)
```
在上面的代码中,我们首先读取语料库,将其转化为词向量矩阵,并且根据窗口大小构建训练数据集。然后,我们构建 CBOW 模型,其中使用了 Embedding 层和 Lambda 层,最后将模型训练得到的词向量矩阵输出。
阅读全文