python实现tf-idf中文文本的分类,包含构建词典
时间: 2024-03-25 16:42:02 浏览: 23
要实现中文文本的分类并进行tf-idf处理,需要先进行分词和构建词典,然后计算每个文档的tf-idf值。下面是一个简单的示例代码:
```python
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import os
# 读取数据
def read_data(path):
X, y = [], []
for filename in os.listdir(path):
with open(os.path.join(path, filename), 'r', encoding='utf-8') as f:
content = f.read()
X.append(content)
y.append(filename.split('.')[0])
return X, y
# 分词
def tokenize(text):
return list(jieba.cut(text))
# 构建词典
def build_vocab(X):
vocab = set()
for text in X:
words = tokenize(text)
vocab.update(words)
return dict(zip(vocab, range(len(vocab))))
# 计算tf-idf
def calculate_tfidf(X, vocab):
vectorizer = TfidfVectorizer(vocabulary=vocab, tokenizer=tokenize)
tfidf = vectorizer.fit_transform(X)
return tfidf.toarray()
# 加载数据
X, y = read_data('data')
# 构建词典
vocab = build_vocab(X)
# 计算tf-idf
X_tfidf = calculate_tfidf(X, vocab)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2)
# 训练模型
clf = MultinomialNB()
clf.fit(X_train, y_train)
# 预测
y_pred = clf.predict(X_test)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
```
以上代码中,read_data函数用于读取数据,tokenize函数用于分词,build_vocab函数用于构建词典,calculate_tfidf函数用于计算tf-idf值,然后使用sklearn库中的TfidfVectorizer进行计算。最后,使用MultinomialNB模型进行训练和预测,并计算准确率。
相关推荐
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![py](https://img-home.csdnimg.cn/images/20210720083646.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)