python中的jieba实现对一个文件夹中的所有文档进行分词,根据停用词表去除停用词,基于文档中的非停用词及其TF-IDF值,生成该文档的词向量,具体代码...
时间: 2024-02-15 13:01:38 浏览: 27
采用python程序设计语言,进行分词,再去掉停用词和标点符号等,生成文档的词典,接着根据词典和文档内容生成词项的倒排记录表
好的,以下是代码实现:
```python
import os
import jieba
import math
# 读取停用词,返回一个列表
def get_stopwords(file_path):
stopwords = []
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
stopwords.append(line.strip())
return stopwords
# 计算TF-IDF值
def get_tf_idf(word, words_list, words_count, docs_count):
tf = words_list.count(word) / words_count
idf = math.log(docs_count / (sum([1 for w_l in words_list if word in w_l])))
return tf * idf
# 对一个文档进行分词,返回词语列表
def cut_words(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
text = f.read()
words_list = jieba.lcut(text)
return words_list
# 对一个文档生成词向量
def get_doc_vector(words_list, stopwords, words_count, docs_count):
vector = []
for word in words_list:
if word not in stopwords:
tf_idf = get_tf_idf(word, words_list, words_count, docs_count)
vector.append((word, tf_idf))
return vector
# 遍历文件夹,生成所有文档的词向量
def get_all_vectors(folder_path, stopwords_path):
stopwords = get_stopwords(stopwords_path)
all_vectors = []
docs_count = 0
for root, dirs, files in os.walk(folder_path):
for file in files:
file_path = os.path.join(root, file)
words_list = cut_words(file_path)
words_count = len(words_list)
vector = get_doc_vector(words_list, stopwords, words_count, docs_count)
all_vectors.append(vector)
docs_count += 1
return all_vectors
```
使用方法:
```python
folder_path = 'your_folder_path'
stopwords_path = 'your_stopwords_path'
all_vectors = get_all_vectors(folder_path, stopwords_path)
```
其中,`folder_path` 为待分词的文件夹路径,`stopwords_path` 为停用词表路径。`all_vectors` 为生成的所有文档的词向量列表,每个文档的词向量为一个二元组的列表,形如:`[('词1', tfidf1), ('词2', tfidf2), ...]`。
阅读全文