根据提供的data.csv附件,对外卖评论进行分类,2分类,正向负向,两种方式:sklearn调用以及自己编写模型类。根据分类结果,分别画出正向词云图,负向词云图
时间: 2023-09-09 11:13:37 浏览: 135
好的,我会根据您的要求来回答问题。首先,为了进行外卖评论的分类,我们需要对数据进行预处理,具体步骤如下:
1. 导入所需库
```python
import pandas as pd
import jieba
```
2. 读取数据
```python
data = pd.read_csv('data.csv', encoding='utf-8')
```
3. 数据清洗
去除无用的列和重复的行,并将评论内容转换为字符串类型。
```python
data.drop(['Unnamed: 0', 'shop_id', 'star', 'time'], axis=1, inplace=True)
data.drop_duplicates(inplace=True)
data['comment'] = data['comment'].astype(str)
```
4. 分词
使用结巴分词对评论进行分词,并去除停用词。
```python
stopwords = pd.read_csv('stopwords.txt', sep='\t', header=None)
stopwords = set(stopwords[0])
def cut_words(comment):
words = jieba.cut(comment)
words = [word for word in words if word not in stopwords]
return ' '.join(words)
data['comment'] = data['comment'].apply(cut_words)
```
5. 划分数据集
将数据集划分为训练集和测试集。
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['comment'], data['label'], test_size=0.2, random_state=42)
```
6. 特征提取
使用TF-IDF对文本进行特征提取。
```python
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)
```
现在,我们可以开始进行分类了。下面,我分别介绍使用sklearn调用和自己编写模型类的方法。
### 使用sklearn调用
我们可以使用sklearn中的多种分类算法对数据进行分类。这里我选择使用朴素贝叶斯算法进行分类。
1. 训练模型
```python
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
```
2. 预测结果
```python
y_pred = clf.predict(X_test)
```
3. 评估模型
```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1-score:', f1_score(y_test, y_pred))
```
4. 生成词云图
```python
from wordcloud import WordCloud
import matplotlib.pyplot as plt
pos_words = ' '.join(data[data['label'] == 1]['comment'])
neg_words = ' '.join(data[data['label'] == 0]['comment'])
pos_wordcloud = WordCloud(background_color='white', width=800, height=600).generate(pos_words)
neg_wordcloud = WordCloud(background_color='white', width=800, height=600).generate(neg_words)
plt.imshow(pos_wordcloud)
plt.axis('off')
plt.show()
plt.imshow(neg_wordcloud)
plt.axis('off')
plt.show()
```
### 自己编写模型类
我们也可以自己编写模型类进行分类。这里我使用PyTorch和torchtext库进行编写。
1. 导入所需库
```python
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext
from torchtext.data import Field, TabularDataset, BucketIterator
```
2. 定义Field
```python
TEXT = Field(tokenize='spacy', tokenizer_language='zh')
LABEL = Field(sequential=False)
```
3. 读取数据
```python
datafields = [('comment', TEXT), ('label', LABEL)]
trn, tst = TabularDataset.splits(path='.', train='train.csv', test='test.csv', format='csv', fields=datafields)
```
4. 构建词汇表
```python
TEXT.build_vocab(trn, max_size=8000)
LABEL.build_vocab(trn)
```
5. 定义模型类
```python
class SentimentClassifier(nn.Module):
def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
super().__init__()
self.embedding = nn.Embedding(input_dim, embedding_dim)
self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, bidirectional=True, dropout=0.5)
self.fc = nn.Linear(hidden_dim * 2, output_dim)
self.dropout = nn.Dropout(0.5)
def forward(self, text):
embedded = self.dropout(self.embedding(text))
output, (hidden, cell) = self.rnn(embedded)
hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
return self.fc(hidden)
```
6. 训练模型
```python
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, test_iterator = BucketIterator.splits((trn, tst), batch_size=32, device=device)
model = SentimentClassifier(len(TEXT.vocab), 100, 256, len(LABEL.vocab))
optimizer = optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()
model = model.to(device)
criterion = criterion.to(device)
num_epochs = 10
for epoch in range(num_epochs):
for batch in train_iterator:
optimizer.zero_grad()
text = batch.comment
label = batch.label
output = model(text).squeeze(1)
loss = criterion(output, label)
loss.backward()
optimizer.step()
print('Epoch:', epoch+1)
```
7. 预测结果
```python
y_pred = []
for batch in test_iterator:
text = batch.comment
output = model(text).squeeze(1)
pred = output.argmax(dim=1)
y_pred.extend(pred.tolist())
```
8. 评估模型
```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
y_test = [LABEL.vocab.stoi[label] for label in tst.examples[0].label]
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1-score:', f1_score(y_test, y_pred))
```
9. 生成词云图
```python
pos_words = ' '.join(data[data['label'] == 1]['comment'])
neg_words = ' '.join(data[data['label'] == 0]['comment'])
pos_wordcloud = WordCloud(background_color='white', width=800, height=600).generate(pos_words)
neg_wordcloud = WordCloud(background_color='white', width=800, height=600).generate(neg_words)
plt.imshow(pos_wordcloud)
plt.axis('off')
plt.show()
plt.imshow(neg_wordcloud)
plt.axis('off')
plt.show()
```
以上就是对外卖评论进行分类并画出词云图的过程,希望能对您有所帮助!
阅读全文