加载多个停用词表后对多个文件夹中所有txt文件分别进行去停用词表
时间: 2023-08-05 15:09:53 浏览: 130
您可以按照以下步骤实现加载多个停用词表并对多个文件夹中所有txt文件分别进行去停用词表:
1. 安装所需的库:jieba和os
```python
!pip install jieba
```
2. 加载停用词表
```python
def load_stopwords(stopwords_path_list):
stopwords = set()
for stopwords_path in stopwords_path_list:
with open(stopwords_path, 'r', encoding='UTF-8') as f:
for line in f:
stopwords.add(line.strip())
return stopwords
```
其中,`stopwords_path_list`是一个包含多个停用词表路径的列表。
3. 对单个文件进行去停用词表
```python
def remove_stopwords(file_path, stopwords):
with open(file_path, 'r', encoding='UTF-8') as f:
text = f.read()
words = jieba.cut(text)
result = []
for word in words:
if word not in stopwords:
result.append(word)
return ' '.join(result)
```
其中,`file_path`为需要去停用词表的文件路径,`stopwords`为加载的停用词表。
4. 对文件夹中所有txt文件进行去停用词表
```python
def remove_stopwords_folder(folder_path, stopwords):
for root, dirs, files in os.walk(folder_path):
for file in files:
if file.endswith('.txt'):
file_path = os.path.join(root, file)
result = remove_stopwords(file_path, stopwords)
with open(file_path, 'w', encoding='UTF-8') as f:
f.write(result)
```
其中,`folder_path`为需要去停用词表的文件夹路径。
5. 对多个文件夹中所有txt文件分别进行去停用词表
```python
def remove_stopwords_folders(folders_path_list, stopwords):
for folder_path in folders_path_list:
remove_stopwords_folder(folder_path, stopwords)
```
其中,`folders_path_list`是一个包含多个文件夹路径的列表。
最后,您只需要按照以下步骤调用上述函数即可实现加载多个停用词表并对多个文件夹中所有txt文件分别进行去停用词表:
```python
stopwords_path_list = ['stopwords1.txt', 'stopwords2.txt']
stopwords = load_stopwords(stopwords_path_list)
folders_path_list = ['folder1 path', 'folder2 path']
remove_stopwords_folders(folders_path_list, stopwords)
```
其中,`stopwords_path_list`为停用词表路径列表,`folders_path_list`为需要去停用词表的文件夹路径列表。
阅读全文