使用word2vec对文本词频进行统计,给出代码
时间: 2024-01-11 15:03:08 浏览: 169
词频统计程序
4星 · 用户满意度95%
好的,以下是使用word2vec对文本词频进行统计的Python代码:
```python
import collections
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import imdb
from gensim.models import Word2Vec
# 加载IMDB数据集
(x_train, y_train), (x_test, y_test) = imdb.load_data()
# 构建词典
max_features = 20000
maxlen = 80
batch_size = 32
word_model = Word2Vec.load("word2vec.model")
word_vectors = word_model.wv
word_index = imdb.get_word_index()
nb_words = min(max_features, len(word_index))
embedding_matrix = np.zeros((nb_words, 100))
for (word, index) in word_index.items():
if index >= max_features:
continue
try:
embedding_vector = word_vectors[word]
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector
except:
pass
# 将文本序列填充到相同长度
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
# 构建LSTM模型
model = Sequential()
model.add(Embedding(nb_words, 100, weights=[embedding_matrix], input_length=maxlen))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Dropout(0.5))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
# 训练模型
model.fit(x_train, y_train, batch_size=batch_size, epochs=10, validation_data=(x_test, y_test))
# 统计词频
counter = collections.Counter()
for sequence in x_train:
for word in sequence:
counter[word] += 1
for sequence in x_test:
for word in sequence:
counter[word] += 1
# 输出词频统计结果
print(counter.most_common(10))
```
这段代码首先使用Word2Vec模型构建词向量矩阵,然后使用该矩阵构建LSTM模型,对IMDB数据集进行训练。训练完成后,代码使用collections.Counter()函数统计训练集和测试集中所有词的出现频率,并输出前10个出现频率最高的词。
阅读全文