帮我写一个是用word2vec提取特征的文本分类任务的代码
时间: 2023-08-28 20:25:51 浏览: 107
pytorch文本分类word2vec+TextCNN. 完整代码+数据 可直接运行
5星 · 资源好评率100%
好的,这里是一个使用word2vec提取特征的文本分类任务的Python代码示例:
```python
import numpy as np
import pandas as pd
import gensim
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import LSTM, Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
# Load the dataset
data = pd.read_csv('dataset.csv')
# Split the dataset into training and testing sets
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
# Load the pre-trained Word2Vec model
model = gensim.models.KeyedVectors.load_word2vec_format('word2vec.bin', binary=True)
# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['text'])
# Convert the text into sequences of word indices
train_sequences = tokenizer.texts_to_sequences(train_data['text'])
test_sequences = tokenizer.texts_to_sequences(test_data['text'])
# Pad the sequences to a fixed length
max_seq_length = 100
train_sequences = pad_sequences(train_sequences, maxlen=max_seq_length)
test_sequences = pad_sequences(test_sequences, maxlen=max_seq_length)
# Create the embedding matrix
embedding_dim = 300
vocab_size = len(tokenizer.word_index) + 1
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
if word in model:
embedding_matrix[i] = model[word]
# Convert the labels into one-hot encoded vectors
train_labels = to_categorical(train_data['label'])
test_labels = to_categorical(test_data['label'])
# Define the model architecture
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], input_length=max_seq_length, trainable=False))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(train_labels.shape[1], activation='softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train the model
model.fit(train_sequences, train_labels, batch_size=128, epochs=10, validation_data=(test_sequences, test_labels))
```
这段代码假定你已经有了一个预处理好的数据集,其中包括一个名为`text`的列和一个名为`label`的列。你还需要一个预训练的Word2Vec模型,可以使用Gensim库训练。在这个示例中,我们使用了一个简单的LSTM模型来对文本进行分类。
阅读全文