cnn对文本进行情感极性分析,告诉我代码,情感词典我有
时间: 2023-06-16 20:08:03 浏览: 54
好的,首先需要导入必要的库,包括`numpy`、`pandas`、`keras`等,代码如下:
```python
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Conv1D, MaxPooling1D, Dropout
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
```
然后,我们将情感词典读入并进行预处理,包括去除停用词、分词等操作,代码如下:
```python
# 读入情感词典
sentiment_dict = pd.read_excel('情感词典.xlsx')
# 去除停用词
stopwords = pd.read_csv('stopwords.txt', header=None, sep='\t')
stopwords = list(stopwords[0])
sentiment_dict = sentiment_dict[~sentiment_dict['word'].isin(stopwords)]
# 分词
import jieba
sentiment_dict['word'] = sentiment_dict['word'].apply(lambda x: ' '.join(jieba.cut(x)))
# 构建情感词典
pos_dict = sentiment_dict[sentiment_dict['positive'] == 1]['word'].tolist()
neg_dict = sentiment_dict[sentiment_dict['negative'] == 1]['word'].tolist()
# 定义函数,判断文本中的情感词并计数
def count_sentiment_words(text):
pos_words = [w for w in pos_dict if w in text]
neg_words = [w for w in neg_dict if w in text]
pos_count = len(pos_words)
neg_count = len(neg_words)
return pos_count, neg_count
```
接下来,我们需要读入训练数据并进行预处理。假设训练数据为一个CSV格式的文件,其中包含两列,一列为文本内容,另一列为情感极性(0表示负面,1表示正面),代码如下:
```python
# 读入训练数据
data = pd.read_csv('train_data.csv')
# 对文本进行分词
data['text'] = data['text'].apply(lambda x: ' '.join(jieba.cut(x)))
# 判断情感词并计数
data['pos_count'], data['neg_count'] = zip(*data['text'].apply(count_sentiment_words))
# 将情感极性转为0和1
data['polarity'] = data['polarity'].apply(lambda x: 1 if x == 'positive' else 0)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['polarity'], test_size=0.2, random_state=42)
```
接下来,我们需要对文本进行编码,将每个单词转化为一个数字。这里使用Keras中的Tokenizer类实现。我们需要指定词汇表的大小(即最多考虑多少个单词)、每个文本的最大长度等参数。代码如下:
```python
# 对文本进行编码
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
max_len = 200
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)
```
接下来,我们可以构建CNN模型。模型包含一个嵌入层、一个卷积层、一个池化层、一个全连接层和一个输出层。代码如下:
```python
# 构建模型
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=128, input_length=max_len))
model.add(Conv1D(filters=64, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=4))
model.add(Dropout(0.2))
model.add(Dense(units=10, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
```
最后,我们可以使用训练集对模型进行训练,并在测试集上进行测试。代码如下:
```python
# 训练模型
model.fit(X_train_pad, y_train, epochs=10, batch_size=32, validation_data=(X_test_pad, y_test))
# 在测试集上评估模型
score = model.evaluate(X_test_pad, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
```
完整代码如下:
```python
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Conv1D, MaxPooling1D, Dropout
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import jieba
# 读入情感词典
sentiment_dict = pd.read_excel('情感词典.xlsx')
# 去除停用词
stopwords = pd.read_csv('stopwords.txt', header=None, sep='\t')
stopwords = list(stopwords[0])
sentiment_dict = sentiment_dict[~sentiment_dict['word'].isin(stopwords)]
# 分词
sentiment_dict['word'] = sentiment_dict['word'].apply(lambda x: ' '.join(jieba.cut(x)))
# 构建情感词典
pos_dict = sentiment_dict[sentiment_dict['positive'] == 1]['word'].tolist()
neg_dict = sentiment_dict[sentiment_dict['negative'] == 1]['word'].tolist()
# 定义函数,判断文本中的情感词并计数
def count_sentiment_words(text):
pos_words = [w for w in pos_dict if w in text]
neg_words = [w for w in neg_dict if w in text]
pos_count = len(pos_words)
neg_count = len(neg_words)
return pos_count, neg_count
# 读入训练数据
data = pd.read_csv('train_data.csv')
# 对文本进行分词
data['text'] = data['text'].apply(lambda x: ' '.join(jieba.cut(x)))
# 判断情感词并计数
data['pos_count'], data['neg_count'] = zip(*data['text'].apply(count_sentiment_words))
# 将情感极性转为0和1
data['polarity'] = data['polarity'].apply(lambda x: 1 if x == 'positive' else 0)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['polarity'], test_size=0.2, random_state=42)
# 对文本进行编码
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
max_len = 200
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)
# 构建模型
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=128, input_length=max_len))
model.add(Conv1D(filters=64, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=4))
model.add(Dropout(0.2))
model.add(Dense(units=10, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# 训练模型
model.fit(X_train_pad, y_train, epochs=10, batch_size=32, validation_data=(X_test_pad, y_test))
# 在测试集上评估模型
score = model.evaluate(X_test_pad, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
```