RNN通过python情感分析
时间: 2023-09-08 21:08:14 浏览: 48
1. 导入必要的库
```python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
```
2. 加载数据集
```python
df = pd.read_csv('data.csv')
```
3. 数据预处理
```python
# 去除无用的列
df.drop(columns=['id', 'date', 'query', 'user'], inplace=True)
# 重命名列
df.columns = ['sentiment', 'text']
# 将sentiment列中的0替换为负面情感,4替换为正面情感
df['sentiment'] = df['sentiment'].replace({0: 'negative', 4: 'positive'})
# 将sentiment列中的值转换为0或1,0表示负面情感,1表示正面情感
df['sentiment'] = df['sentiment'].replace({'negative': 0, 'positive': 1})
# 将数据集拆分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['sentiment'], test_size=0.2, random_state=42)
# 创建一个tokenizer对象,用于将文本转换为数字序列
tokenizer = Tokenizer(num_words=10000, oov_token='<OOV>')
tokenizer.fit_on_texts(X_train)
# 将训练集和测试集的文本数据转换为数字序列
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
# 对数字序列进行填充,使每个序列长度相同
max_len = 50
X_train_seq = pad_sequences(X_train_seq, maxlen=max_len, padding='post', truncating='post')
X_test_seq = pad_sequences(X_test_seq, maxlen=max_len, padding='post', truncating='post')
# 输出训练集和测试集的形状
print(X_train_seq.shape, y_train.shape)
print(X_test_seq.shape, y_test.shape)
```
4. 构建RNN模型
```python
model = keras.Sequential([
keras.layers.Embedding(input_dim=10000, output_dim=32, input_length=max_len),
keras.layers.SimpleRNN(units=32, return_sequences=True),
keras.layers.SimpleRNN(units=32),
keras.layers.Dense(units=1, activation='sigmoid')
])
model.summary()
```
5. 编译和训练模型
```python
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train_seq, y_train, validation_split=0.2, epochs=5, batch_size=128)
```
6. 评估模型
```python
# 绘制训练集和测试集的acc和loss曲线
plt.plot(history.history['accuracy'], label='train_acc')
plt.plot(history.history['val_accuracy'], label='val_acc')
plt.plot(history.history['loss'], label='train_loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.legend()
plt.show()
# 在测试集上评估模型
test_loss, test_acc = model.evaluate(X_test_seq, y_test)
print('Test Accuracy:', test_acc)
```
7. 预测结果
```python
# 对一段文本进行情感分类
text = "I hate this movie, it's so boring!"
text_seq = tokenizer.texts_to_sequences([text])
text_seq = pad_sequences(text_seq, maxlen=max_len, padding='post', truncating='post')
pred = model.predict(text_seq)
sentiment = 'positive' if pred > 0.5 else 'negative'
print('Text:', text)
print('Sentiment:', sentiment)
```