首页python中的jieba实现对一个文件夹中的所有文档进行分词，根据停用词表去除停用词，基于文档中的非停用词及其TF-IDF值，生成该文档的词向量，具体代码...

python中的jieba实现对一个文件夹中的所有文档进行分词，根据停用词表去除停用词，基于文档中的非停用词及其TF-IDF值，生成该文档的词向量，具体代码...

时间: 2024-02-24 11:53:16 浏览: 95

好的，以下是代码实现： ```python import os import jieba import jieba.analyse from sklearn.feature_extraction.text import TfidfVectorizer # 读取停用词表 stopwords = set() with open('stopwords.txt', 'r', encoding='utf-8') as f: for line in f: stopwords.add(line.strip()) # 分词函数 def cut_words(file_path): with open(file_path, 'r', encoding='utf-8') as f: content = f.read() words = jieba.cut(content) res = [] for word in words: if word not in stopwords: res.append(word) return ' '.join(res) # 获取文件夹中所有文件的路径 def get_file_path(root_path): file_list = [] for root, dirs, files in os.walk(root_path): for file in files: file_list.append(os.path.join(root, file)) return file_list # 生成词向量 def generate_vectors(file_path_list): corpus = [] for file_path in file_path_list: corpus.append(cut_words(file_path)) vectorizer = TfidfVectorizer() vectors = vectorizer.fit_transform(corpus) return vectors, vectorizer.get_feature_names() # 测试 root_path = 'test' file_path_list = get_file_path(root_path) vectors, feature_names = generate_vectors(file_path_list) print(vectors.toarray()) print(feature_names) ``` 其中，`stopwords.txt` 文件存储了停用词，`cut_words` 函数实现了分词并去除停用词，`get_file_path` 函数获取文件夹下所有文件的路径，`generate_vectors` 函数生成词向量。最后，我们可以通过 `vectors.toarray()` 获取所有文档的词向量，通过 `feature_names` 获取所有非停用词作为特征。

阅读全文