使用LSTM模型进行微博文本情感预测的完整代码
时间: 2024-03-05 14:48:26 浏览: 77
基于LSTM结构的文本情感分析
以下是使用Python和Keras框架实现微博文本情感预测的完整代码:
```python
import numpy as np
import pandas as pd
import jieba
import keras
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# 读取微博文本数据
df = pd.read_csv('weibo_sentiment.csv')
# 对微博文本进行分词和去除停用词等预处理工作
stopwords = [line.strip() for line in open('stopwords.txt', 'r', encoding='utf-8').readlines()]
def preprocess(text):
text = jieba.cut(text)
text = [word for word in text if word not in stopwords]
return text
df['text'] = df['text'].apply(preprocess)
# 构建词汇表和词向量表示
vocab_size = 5000
tokenizer = keras.preprocessing.text.Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(df['text'])
sequences = tokenizer.texts_to_sequences(df['text'])
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = keras.preprocessing.sequence.pad_sequences(sequences, maxlen=100)
# 标签编码
encoder = LabelEncoder()
labels = encoder.fit_transform(df['sentiment'])
print('Classes:', np.unique(labels))
print('Number of unique words:', len(np.unique(np.hstack(data))))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
# 数据集划分为训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
print('Number of positive and negative reviews in training and test set')
print(y_train.sum(axis=0))
print(y_test.sum(axis=0))
# LSTM模型构建
embedding_size = 32
model = Sequential()
model.add(Embedding(vocab_size, embedding_size, input_length=100))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
# 模型训练
batch_size = 64
epochs = 5
model.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size=batch_size, epochs=epochs)
# 模型评估
scores = model.evaluate(x_test, y_test, verbose=0)
print('Test accuracy:', scores[1])
```
注意:代码中使用的数据集和停用词表需要根据实际情况进行替换。另外,模型训练过程可能较耗时,可以根据实际情况进行调整。
阅读全文