python实现tf-idf中文文本特征提取以及读取测试集文件
时间: 2023-07-18 17:46:19 浏览: 92
以下是使用Python的sklearn库中的TfidfVectorizer类实现中文文本特征提取以及读取测试集文件的示例代码:
```python
import os
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
# 读取文件内容
def read_file(path):
with open(path, 'r', encoding='utf-8') as f:
content = f.read()
return content
# 获取所有文件的路径
def get_files_path(folder_path):
files_path = []
for root, dirs, files in os.walk(folder_path):
for file in files:
file_path = os.path.join(root, file)
files_path.append(file_path)
return files_path
# 对文本进行分词
def tokenizer(text):
return [word for word in jieba.cut(text)]
# 对文本进行特征提取
def feature_extraction(files_path):
corpus = []
for path in files_path:
content = read_file(path)
corpus.append(content)
vectorizer = TfidfVectorizer(tokenizer=tokenizer)
X = vectorizer.fit_transform(corpus)
return X
# 读取测试集文件
def read_test_file(path):
with open(path, 'r', encoding='utf-8') as f:
content_list = f.readlines()
content_list = [content.strip() for content in content_list]
return content_list
# 测试
train_folder_path = 'path/to/train/folder' # 训练集文件夹路径
test_file_path = 'path/to/test/file' # 测试集文件路径
train_files_path = get_files_path(train_folder_path)
X_train = feature_extraction(train_files_path)
test_content_list = read_test_file(test_file_path)
X_test = vectorizer.transform(test_content_list)
print(X_train.toarray())
print(X_test.toarray())
```
在以上代码中,我们使用了os模块读取文件,使用jieba分词器对文本进行分词,并使用TfidfVectorizer类对文本进行特征提取。同时,我们也实现了读取测试集文件的函数,并使用特征提取器对测试集进行特征提取。最终输出了训练集和测试集的特征向量。
阅读全文