不同文本特征建立文本分类模型、评估模型、并对疫情情绪进行分析代码
时间: 2023-06-19 20:10:15 浏览: 71
这里提供一个基于多种文本特征建立文本分类模型、评估模型、并对疫情情绪进行分析的 Python 代码示例。
首先,导入必要的库和数据集。这里使用的是一个新冠疫情相关的英文数据集。
```python
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
data = pd.read_csv('corona_tweets.csv', encoding='utf-8')
```
接下来,数据预处理。针对文本数据,我们需要进行分词、去停用词等操作。
```python
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
stop_words = stopwords.words('english')
def preprocess_text(text):
tokens = word_tokenize(text.lower())
tokens = [token for token in tokens if token.isalpha()]
tokens = [token for token in tokens if token not in stop_words]
return ' '.join(tokens)
data['processed_text'] = data['OriginalTweet'].apply(preprocess_text)
```
然后,我们可以尝试使用不同的文本特征建立分类模型。这里使用了词频统计和 TF-IDF 两种特征。
```python
# 使用词频统计特征
count_vectorizer = CountVectorizer()
count_features = count_vectorizer.fit_transform(data['processed_text'])
# 使用 TF-IDF 特征
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(data['processed_text'])
```
接下来,我们可以将数据集划分为训练集和测试集,并使用多项式朴素贝叶斯分类器建立模型。
```python
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(tfidf_features, data['Sentiment'], test_size=0.2, random_state=42)
# 建立多项式朴素贝叶斯分类器
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
# 预测测试集并评估模型
y_pred = nb_classifier.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('Classification Report:\n', classification_report(y_test, y_pred))
```
最后,我们可以对模型进行可视化分析,比如绘制混淆矩阵。
```python
# 绘制混淆矩阵
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='Blues', fmt='d')
```
完整代码如下:
```python
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
data = pd.read_csv('corona_tweets.csv', encoding='utf-8')
stop_words = stopwords.words('english')
def preprocess_text(text):
tokens = word_tokenize(text.lower())
tokens = [token for token in tokens if token.isalpha()]
tokens = [token for token in tokens if token not in stop_words]
return ' '.join(tokens)
data['processed_text'] = data['OriginalTweet'].apply(preprocess_text)
# 使用词频统计特征
count_vectorizer = CountVectorizer()
count_features = count_vectorizer.fit_transform(data['processed_text'])
# 使用 TF-IDF 特征
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(data['processed_text'])
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(tfidf_features, data['Sentiment'], test_size=0.2, random_state=42)
# 建立多项式朴素贝叶斯分类器
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
# 预测测试集并评估模型
y_pred = nb_classifier.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('Classification Report:\n', classification_report(y_test, y_pred))
# 绘制混淆矩阵
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='Blues', fmt='d')
```