用jieba分词,并且将关键词文本文档用jieba.load_userdict设为jieba的自定义词典,根据关键词文本文档的关键词对文件夹中各文本进行对应关键词词频统计,并剔除停用词,停用词文档为D,并且生成excel,excel的列标题为关键词文档中的文件名,行标题为文件夹中各文档名称,python怎么写
时间: 2023-04-08 13:04:18 浏览: 162
首先,需要导入jieba和pandas库:
```python
import jieba
import pandas as pd
```
然后,读取停用词文档D,并将停用词存储到一个列表中:
```python
stopwords = []
with open('D.txt', 'r', encoding='utf-8') as f:
for line in f:
stopwords.append(line.strip())
```
接下来,读取关键词文本文档,并将关键词添加到jieba的自定义词典中:
```python
jieba.load_userdict('keywords.txt')
```
然后,定义一个函数,用于统计文本中各关键词的词频:
```python
def count_words(text, keywords):
word_count = {}
words = jieba.lcut(text)
for word in words:
if word in keywords and word not in stopwords:
if word not in word_count:
word_count[word] = 1
else:
word_count[word] += 1
return word_count
```
最后,遍历文件夹中的各文本文件,调用count_words函数统计词频,并将结果存储到一个字典中。然后,使用pandas库将字典转换为DataFrame,并将结果保存为excel文件:
```python
import os
folder_path = 'folder_path'
keywords_file = 'keywords.txt'
# 读取关键词文本文档
keywords = []
with open(keywords_file, 'r', encoding='utf-8') as f:
for line in f:
keywords.append(line.strip())
# 将关键词添加到jieba的自定义词典中
jieba.load_userdict(keywords_file)
# 统计各文本文件中各关键词的词频
result = {}
for filename in os.listdir(folder_path):
if filename.endswith('.txt'):
file_path = os.path.join(folder_path, filename)
with open(file_path, 'r', encoding='utf-8') as f:
text = f.read()
word_count = count_words(text, keywords)
result[filename] = word_count
# 将结果转换为DataFrame,并保存为excel文件
df = pd.DataFrame(result).fillna(0)
df.to_excel('result.xlsx')
```
以上就是用jieba分词,并根据关键词文本文档对文件夹中各文本进行对应关键词词频统计的代码。
阅读全文