如何运用KNN、决策树、SVM、朴素贝叶斯等算法对爬取的景区评论进行分类,并作出可视化图形,请写出代码
时间: 2023-12-25 10:02:57 浏览: 210
首先需要导入必要的库:
```python
import pandas as pd
import numpy as np
import jieba
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
```
然后,读取景区评论的CSV文件:
```python
df = pd.read_csv('comments.csv')
```
接下来,我们需要对评论进行分词处理,可以使用`jieba`库中的`cut`函数:
```python
def cut_comment(comment):
return ' '.join(jieba.cut(comment))
df['cut_comment'] = df['comment'].apply(cut_comment)
```
然后,我们可以使用`CountVectorizer`或`TfidfVectorizer`将评论转化为特征向量,这里以`TfidfVectorizer`为例:
```python
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['cut_comment'])
y = df['label'].values
```
接下来,我们可以将数据集拆分为训练集和测试集:
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
然后,我们可以使用KNN、决策树、SVM、朴素贝叶斯等算法进行分类:
```python
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
svm = SVC()
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb_pred = nb.predict(X_test)
```
最后,我们可以输出分类结果的准确率、召回率、F1-score等指标:
```python
print('KNN分类结果:\n', classification_report(y_test, knn_pred))
print('决策树分类结果:\n', classification_report(y_test, dt_pred))
print('SVM分类结果:\n', classification_report(y_test, svm_pred))
print('朴素贝叶斯分类结果:\n', classification_report(y_test, nb_pred))
```
我们还可以将分类结果可视化,比如绘制KNN的混淆矩阵:
```python
from sklearn.metrics import confusion_matrix
import seaborn as sns
knn_cm = confusion_matrix(y_test, knn_pred)
sns.heatmap(knn_cm, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()
```
完整代码如下:
```python
import pandas as pd
import numpy as np
import jieba
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# 读取景区评论数据
df = pd.read_csv('comments.csv')
# 对评论进行分词处理
def cut_comment(comment):
return ' '.join(jieba.cut(comment))
df['cut_comment'] = df['comment'].apply(cut_comment)
# 将评论转化为特征向量
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['cut_comment'])
y = df['label'].values
# 拆分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 使用KNN、决策树、SVM、朴素贝叶斯等算法进行分类
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
svm = SVC()
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb_pred = nb.predict(X_test)
# 输出分类结果
print('KNN分类结果:\n', classification_report(y_test, knn_pred))
print('决策树分类结果:\n', classification_report(y_test, dt_pred))
print('SVM分类结果:\n', classification_report(y_test, svm_pred))
print('朴素贝叶斯分类结果:\n', classification_report(y_test, nb_pred))
# 绘制KNN的混淆矩阵
knn_cm = confusion_matrix(y_test, knn_pred)
sns.heatmap(knn_cm, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()
```
阅读全文