写出基于TF-IDF+KNN对txt文件格式的中文文本进行分类的代码
时间: 2023-12-14 22:36:59 浏览: 110
以下是基于TF-IDF+KNN对txt文件格式的中文文本进行分类的代码示例:
```python
import os
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
# 读取数据集
def load_dataset(path):
dataset = []
labels = []
for root, dirs, files in os.walk(path):
for file in files:
label = root.split("/")[-1]
labels.append(label)
file_path = os.path.join(root, file)
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
dataset.append(content)
return dataset, labels
# 中文分词
def chinese_word_cut(text):
return " ".join(jieba.cut(text))
# 计算TF-IDF值
def tf_idf(dataset):
vectorizer = TfidfVectorizer(tokenizer=chinese_word_cut, max_features=5000)
X = vectorizer.fit_transform(dataset)
return X.toarray(), vectorizer
# KNN分类
def knn_classifier(X_train, y_train, X_test, k=3):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)
return y_predict
if __name__ == "__main__":
dataset, labels = load_dataset("data")
X, vectorizer = tf_idf(dataset)
y = labels
# 分割训练集和测试集
split_index = int(len(dataset) * 0.7)
X_train, y_train = X[:split_index], y[:split_index]
X_test, y_test = X[split_index:], y[split_index:]
# KNN分类
y_predict = knn_classifier(X_train, y_train, X_test)
print("KNN分类结果:", y_predict)
```
代码中使用了jieba分词库对中文文本进行分词,使用sklearn库中的TfidfVectorizer计算TF-IDF值,使用sklearn库中的KNeighborsClassifier进行KNN分类。在main函数中,首先调用load_dataset函数读取数据集,然后调用tf_idf函数计算TF-IDF值。接着将数据集分割为训练集和测试集,最后调用knn_classifier函数进行KNN分类,并输出分类结果。
阅读全文