用Python实现情绪识别
时间: 2024-04-20 17:06:16 浏览: 101
情绪识别是指通过自然语言处理技术对文本中的情感进行分析和识别,可用于社交媒体情感分析、用户评论情感分析等场景。Python是自然语言处理领域常用的编程语言,下面介绍如何用Python实现情绪识别。
1. 安装必要的软件包
首先需要安装必要的Python软件包,包括nltk、sklearn、numpy等。可以使用pip命令进行安装:
```
pip install nltk
pip install sklearn
pip install numpy
```
2. 准备数据集
情绪识别需要有标注好情感的数据集,可以使用公开的数据集,例如IMDB电影评论情感数据集、Semeval-2017数据集等。这里以IMDB数据集为例,可以从http://ai.stanford.edu/~amaas/data/sentiment/ 下载数据集。
3. 数据预处理
将数据集分为训练集和测试集,并进行数据清洗和预处理。可以使用NLTK对数据进行分词、去除停用词等预处理操作。
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def preprocess(text):
# 分词
words = word_tokenize(text.lower())
# 去除停用词和标点符号
words = [w for w in words if w.isalpha() and w not in stop_words]
# 词形还原
words = [lemmatizer.lemmatize(w) for w in words]
return words
```
4. 特征提取
将文本转换为特征向量,可以使用词袋模型或TF-IDF模型进行特征提取。这里使用TF-IDF模型,可以使用sklearn中的TfidfVectorizer类进行特征提取。
```python
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=preprocess)
X_train = vectorizer.fit_transform(train_data)
X_test = vectorizer.transform(test_data)
```
5. 模型训练和预测
使用分类器对特征向量进行训练和预测。可以使用sklearn中的各种分类器,例如朴素贝叶斯、支持向量机等。这里使用朴素贝叶斯分类器进行训练和预测。
```python
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
```
6. 模型评估
使用各种指标对模型进行评估,例如准确率、精确率、召回率、F1值等。可以使用sklearn中的各种评估函数进行评估。
```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
acc = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
print('Accuracy:', acc)
print('Precision:', precision)
print('Recall:', recall)
print('F1 score:', f1)
```
以上就是使用Python实现情绪识别的流程,完整代码如下:
```python
import os
import tarfile
import random
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# 解压数据集
def extract_data(file_path):
with tarfile.open(file_path, 'r:gz') as tar:
tar.extractall()
# 加载数据集
def load_data(data_dir):
data = []
labels = []
for label in ['pos', 'neg']:
dir_name = os.path.join(data_dir, label)
for fname in os.listdir(dir_name):
if fname.endswith('.txt'):
fpath = os.path.join(dir_name, fname)
with open(fpath, 'r', encoding='utf-8') as f:
data.append(f.read())
labels.append(label)
return data, labels
# 数据预处理
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def preprocess(text):
# 分词
words = word_tokenize(text.lower())
# 去除停用词和标点符号
words = [w for w in words if w.isalpha() and w not in stop_words]
# 词形还原
words = [lemmatizer.lemmatize(w) for w in words]
return words
# 加载数据集
data_dir = 'aclImdb'
if not os.path.isdir(data_dir):
extract_data('aclImdb_v1.tar.gz')
train_data, train_labels = load_data(os.path.join(data_dir, 'train'))
test_data, test_labels = load_data(os.path.join(data_dir, 'test'))
# 特征提取
vectorizer = TfidfVectorizer(tokenizer=preprocess)
X_train = vectorizer.fit_transform(train_data)
X_test = vectorizer.transform(test_data)
# 模型训练和预测
clf = MultinomialNB()
clf.fit(X_train, train_labels)
y_pred = clf.predict(X_test)
# 模型评估
acc = accuracy_score(test_labels, y_pred)
precision = precision_score(test_labels, y_pred, average='macro')
recall = recall_score(test_labels, y_pred, average='macro')
f1 = f1_score(test_labels, y_pred, average='macro')
print('Accuracy:', acc)
print('Precision:', precision)
print('Recall:', recall)
print('F1 score:', f1)
```
参考文献:
1. https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184
2. https://www.kaggle.com/sanikamal/text-classification-with-python-and-keras
阅读全文