对于中文短文本,文本内容是市民投诉内容,对于这个内容来进行二分类,分类的结果应该为投诉分类,用Python实现
时间: 2024-04-30 17:22:43 浏览: 12
1. 数据预处理
首先需要对数据进行预处理,包括数据清洗、分词、去除停用词等步骤。
```python
import jieba
import re
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# 加载停用词
with open("stopwords.txt", "r", encoding="utf-8") as f:
stopwords = f.read().split("\n")
# 加载数据
data = pd.read_csv("complaints.csv")
data.dropna(inplace=True)
# 数据清洗
def clean_text(text):
text = re.sub("[^\u4e00-\u9fa5a-zA-Z0-9]", "", text)
return text
data["content"] = data["content"].apply(clean_text)
# 分词
def segment(text):
words = jieba.cut(text)
words = [word for word in words if word not in stopwords]
return " ".join(words)
data["content"] = data["content"].apply(segment)
# 划分训练集和测试集
train_data, test_data, train_labels, test_labels = train_test_split(
data["content"], data["label"], test_size=0.2, random_state=42
)
# 特征提取
vectorizer = TfidfVectorizer()
train_features = vectorizer.fit_transform(train_data)
test_features = vectorizer.transform(test_data)
```
2. 模型训练和评估
使用逻辑回归模型进行训练,并对模型进行评估。
```python
# 模型训练
clf = LogisticRegression()
clf.fit(train_features, train_labels)
# 模型评估
train_pred = clf.predict(train_features)
train_acc = accuracy_score(train_labels, train_pred)
print("训练集准确率:", train_acc)
test_pred = clf.predict(test_features)
test_acc = accuracy_score(test_labels, test_pred)
print("测试集准确率:", test_acc)
```
3. 模型使用
使用训练好的模型对新的投诉内容进行分类。
```python
def predict(text):
text = clean_text(text)
text = segment(text)
feature = vectorizer.transform([text])
label = clf.predict(feature)[0]
return label
text = "我家门口的垃圾桶经常被人乱扔,希望有人来清理一下"
label = predict(text)
print("投诉分类结果:", label)
```