用python中sklearn逻辑回归实现文本分类
时间: 2023-12-28 14:05:15 浏览: 185
【代码分享】基于python的文本分类(sklearn-决策树和随机森林实现)
5星 · 资源好评率100%
可以使用sklearn库中的逻辑回归模型来实现文本分类。具体步骤如下:
1. 准备数据
首先需要准备用于训练和测试的数据,一般是包含文本和标签的数据集。可以使用如下代码读取数据:
```python
from sklearn.datasets import fetch_20newsgroups
# 下载20类新闻数据集
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
# 获取训练和测试数据集
X_train, y_train = newsgroups_train.data, newsgroups_train.target
X_test, y_test = newsgroups_test.data, newsgroups_test.target
```
2. 特征提取
将文本转换成机器学习算法可以处理的数值特征。可以使用TF-IDF算法将文本转换成向量表示:
```python
from sklearn.feature_extraction.text import TfidfVectorizer
# 构建TF-IDF向量化器
vectorizer = TfidfVectorizer()
# 将训练数据集转换成TF-IDF向量
X_train_tfidf = vectorizer.fit_transform(X_train)
# 将测试数据集转换成TF-IDF向量
X_test_tfidf = vectorizer.transform(X_test)
```
3. 训练模型
使用逻辑回归模型对训练数据进行训练:
```python
from sklearn.linear_model import LogisticRegression
# 构建逻辑回归模型
clf = LogisticRegression()
# 训练模型
clf.fit(X_train_tfidf, y_train)
```
4. 模型评估
使用测试数据集对模型进行评估:
```python
# 对测试数据集进行预测
y_pred = clf.predict(X_test_tfidf)
# 计算准确率
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("准确率:", accuracy)
```
完整的代码如下所示:
```python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# 下载20类新闻数据集
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
# 获取训练和测试数据集
X_train, y_train = newsgroups_train.data, newsgroups_train.target
X_test, y_test = newsgroups_test.data, newsgroups_test.target
# 构建TF-IDF向量化器
vectorizer = TfidfVectorizer()
# 将训练数据集转换成TF-IDF向量
X_train_tfidf = vectorizer.fit_transform(X_train)
# 将测试数据集转换成TF-IDF向量
X_test_tfidf = vectorizer.transform(X_test)
# 构建逻辑回归模型
clf = LogisticRegression()
# 训练模型
clf.fit(X_train_tfidf, y_train)
# 对测试数据集进行预测
y_pred = clf.predict(X_test_tfidf)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print("准确率:", accuracy)
```
阅读全文