使用LSTM模型对微博文本weibo_senti_900.csv进行情感分类的完整代码
时间: 2024-03-05 13:48:37 浏览: 67
基于AdaBoost微博中文评论数据情感分析研究文本分类.完整代码+数据可直接运行
5星 · 资源好评率100%
好的,以下是使用LSTM模型对微博文本进行情感分类的完整代码:
```python
import pandas as pd
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import LSTM, Embedding
from keras.models import Model
from sklearn.model_selection import train_test_split
# 读取数据
df = pd.read_csv('weibo_senti_900.csv')
# 将文本转化为数字序列
MAX_SEQUENCE_LENGTH = 100
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(df['review'])
sequences = tokenizer.texts_to_sequences(df['review'])
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = np.asarray(df['label'])
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
# 构建模型
embedding_layer = Embedding(MAX_NUM_WORDS, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH)
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = LSTM(128, return_sequences=True)(embedded_sequences)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(1, activation='sigmoid')(x)
model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# 训练模型
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=128)
# 评估模型
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', accuracy)
```
注意,这段代码中使用了Keras来构建LSTM模型,需要安装Keras库。另外,需要将数据文件weibo_senti_900.csv放在脚本所在的目录下才能正确读取数据。
阅读全文