使用 TF-IDF预测准确率要写的代码
时间: 2024-06-11 11:09:29 浏览: 232
这里提供一个简单的 TF-IDF 模型的代码示例,用于文本分类任务。
1. 数据预处理
首先需要对数据进行预处理,包括分词、去除停用词等操作。
```python
import jieba
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 加载停用词表
stopwords = pd.read_csv('stopwords.txt', index_col=False, quoting=3, sep="\t", names=['stopword'], encoding='utf-8')
stopwords = stopwords['stopword'].values
# 读取数据集
data = pd.read_csv('data.csv', encoding='utf-8')
# 分词并去除停用词
data['text'] = data['text'].apply(lambda x: " ".join([word for word in jieba.cut(x) if word not in stopwords]))
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42)
```
2. 训练模型
接下来使用 TF-IDF 对文本进行特征提取,并使用朴素贝叶斯进行分类。
```python
# 特征提取
vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
# 训练模型
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha=0.01)
clf.fit(X_train_tfidf, y_train)
```
3. 测试模型
最后使用测试集对模型进行评估。
```python
# 预测
y_pred = clf.predict(X_test_tfidf)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
```
完整代码:
```python
import jieba
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
# 加载停用词表
stopwords = pd.read_csv('stopwords.txt', index_col=False, quoting=3, sep="\t", names=['stopword'], encoding='utf-8')
stopwords = stopwords['stopword'].values
# 读取数据集
data = pd.read_csv('data.csv', encoding='utf-8')
# 分词并去除停用词
data['text'] = data['text'].apply(lambda x: " ".join([word for word in jieba.cut(x) if word not in stopwords]))
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42)
# 特征提取
vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
# 训练模型
clf = MultinomialNB(alpha=0.01)
clf.fit(X_train_tfidf, y_train)
# 预测
y_pred = clf.predict(X_test_tfidf)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
```
阅读全文
相关推荐
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![txt](https://img-home.csdnimg.cn/images/20241231045021.png)
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![-](https://img-home.csdnimg.cn/images/20241231045053.png)
![-](https://img-home.csdnimg.cn/images/20241231044930.png)
![-](https://img-home.csdnimg.cn/images/20241231045053.png)
![-](https://img-home.csdnimg.cn/images/20241231045053.png)
![-](https://img-home.csdnimg.cn/images/20241226111658.png)
![-](https://img-home.csdnimg.cn/images/20241226111658.png)
![-](https://img-home.csdnimg.cn/images/20241226111658.png)
![-](https://img-home.csdnimg.cn/images/20241226111658.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)