请以微博话题“你会原谅伤害过你的父母吗”为例子,进行包含KNN与决策树混合使用(需要包含调参)的文本分类,并进行详细说明,包括但不限于数据收集(仅话题微博)、数据清洗等等,并给出混合使用相比单个使用的优点,以及全部的python代码
时间: 2024-02-20 19:02:25 浏览: 73
首先,需要收集与话题相关的微博数据,并进行数据清洗。这里我们以爬取包含“你会原谅伤害过你的父母吗”关键词的微博为例。
```python
import requests
from bs4 import BeautifulSoup
url = "https://s.weibo.com/weibo?q=%23%E4%BD%A0%E4%BC%9A%E5%8E%9F%E8%B0%85%E4%BC%A4%E5%AE%B3%E8%BF%87%E4%BD%A0%E7%9A%84%E7%88%B6%E6%AF%8D%E5%90%97%23&Refer=top"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
weibo_list = soup.select(".card-wrap")
for weibo in weibo_list:
content = weibo.select_one(".txt").text.strip()
print(content)
```
接着,需要对数据进行清洗,去除无用的符号、标点等。由于这里是文本分类,还需要进行分词处理。
```python
import jieba
def clean_text(text):
# 去除无用的符号
text = text.replace("#", "").replace("转发微博", "").replace("收起全文d", "")
# 分词
words = jieba.cut(text)
return " ".join(words)
# 读取微博数据并进行清洗、分词
with open("weibo.txt", "r", encoding="utf-8") as f:
data = f.readlines()
cleaned_data = []
for text in data:
cleaned_text = clean_text(text)
cleaned_data.append(cleaned_text)
```
接下来,我们使用KNN和决策树混合进行文本分类。首先,需要将文本数据转化为数值特征。这里我们使用TF-IDF进行特征提取。另外,我们将数据集划分为训练集和测试集,并进行特征缩放。
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# 将文本数据转化为数值特征
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(cleaned_data)
# 划分数据集为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)
# 特征缩放
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train.toarray())
X_test = scaler.transform(X_test.toarray())
```
接下来,我们使用KNN和决策树进行文本分类,并使用网格搜索进行超参数调优。
```python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
# 使用网格搜索进行超参数调优
param_grid = {
"n_neighbors": [3, 5, 7, 9],
"weights": ["uniform", "distance"],
"algorithm": ["auto", "ball_tree", "kd_tree", "brute"]
}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("KNN最优参数:", grid_search.best_params_)
param_grid = {
"criterion": ["gini", "entropy"],
"max_depth": [3, 5, 7, 9]
}
dt = DecisionTreeClassifier()
grid_search = GridSearchCV(dt, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("决策树最优参数:", grid_search.best_params_)
# 训练分类器并进行预测
knn = KNeighborsClassifier(n_neighbors=5, weights="distance", algorithm="auto")
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
dt = DecisionTreeClassifier(criterion="entropy", max_depth=7)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
# 混合使用KNN和决策树进行文本分类
ensemble_pred = []
for i in range(len(knn_pred)):
if knn_pred[i] == dt_pred[i]:
ensemble_pred.append(knn_pred[i])
else:
ensemble_pred.append(knn_pred[i])
# 输出分类结果和准确率
print("KNN准确率:", accuracy_score(y_test, knn_pred))
print("决策树准确率:", accuracy_score(y_test, dt_pred))
print("混合使用准确率:", accuracy_score(y_test, ensemble_pred))
```
混合使用KNN和决策树相比单个使用的优点在于,可以充分利用两种算法的优点,提高分类器的准确率和泛化能力。
完整代码如下:
阅读全文