给出一份带情感标注的中文文本数据,包含 4 种情感:喜悦,愤怒、厌恶、低落的train.scv文件的数据集,写出微博评论情感分析python代码
时间: 2023-12-03 16:46:46 浏览: 223
首先需要加载数据集,可以使用 pandas 库读取 csv 文件。代码如下:
```python
import pandas as pd
df = pd.read_csv('train.csv')
```
接下来对数据进行预处理,包括去除空值、停用词过滤、分词等。这里可以使用 jieba 库和中文停用词表,需要提前下载安装。代码如下:
```python
import jieba
import jieba.analyse
import re
# 加载停用词表
stopwords = [line.strip() for line in open('stopwords.txt', 'r', encoding='utf-8').readlines()]
def preprocess(text):
# 去除空值
text = text.replace(' ', '')
# 去除表情符号
text = re.sub('\[.*?\]', '', text)
# 分词
words = jieba.lcut(text)
# 去除停用词
words = [word for word in words if word not in stopwords]
return ' '.join(words)
df['content'] = df['content'].apply(preprocess)
```
接下来使用 sklearn 库中的 TfidfVectorizer 对文本进行特征提取,得到每个文本的 TF-IDF 向量。代码如下:
```python
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['content'])
y = df['sentiment']
```
接下来将数据集划分为训练集和测试集,并使用 sklearn 库中的朴素贝叶斯分类器进行训练和预测。代码如下:
```python
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 训练模型
clf = MultinomialNB()
clf.fit(X_train, y_train)
# 预测测试集
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
```
最后可以使用训练好的模型对新的文本进行情感分析。代码如下:
```python
def predict(text):
text = preprocess(text)
X_new = vectorizer.transform([text])
y_new = clf.predict(X_new)
return y_new[0]
```
完整代码如下:
```python
import pandas as pd
import jieba
import jieba.analyse
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# 加载数据集
df = pd.read_csv('train.csv')
# 加载停用词表
stopwords = [line.strip() for line in open('stopwords.txt', 'r', encoding='utf-8').readlines()]
# 数据预处理
def preprocess(text):
# 去除空值
text = text.replace(' ', '')
# 去除表情符号
text = re.sub('\[.*?\]', '', text)
# 分词
words = jieba.lcut(text)
# 去除停用词
words = [word for word in words if word not in stopwords]
return ' '.join(words)
df['content'] = df['content'].apply(preprocess)
# 特征提取
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['content'])
y = df['sentiment']
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 训练模型
clf = MultinomialNB()
clf.fit(X_train, y_train)
# 预测测试集
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
# 预测新文本
def predict(text):
text = preprocess(text)
X_new = vectorizer.transform([text])
y_new = clf.predict(X_new)
return y_new[0]
```
使用 predict 函数对新的文本进行情感分析:
```python
print(predict('今天心情真好!'))
print(predict('这个产品太差了,不要买!'))
```
阅读全文