加载多个停用词表后对文件夹中所有txt文件进行去停用词表
时间: 2023-08-05 22:09:53 浏览: 80
停用词表停用词.txt
您可以按照以下步骤实现加载多个停用词表并对文件夹中所有txt文件进行去停用词表:
1. 安装所需的库:jieba和os
```python
!pip install jieba
```
2. 加载停用词表
```python
def load_stopwords(stopwords_path_list):
stopwords = set()
for stopwords_path in stopwords_path_list:
with open(stopwords_path, 'r', encoding='UTF-8') as f:
for line in f:
stopwords.add(line.strip())
return stopwords
```
其中,`stopwords_path_list`是一个包含多个停用词表路径的列表。
3. 对单个文件进行去停用词表
```python
def remove_stopwords(file_path, stopwords):
with open(file_path, 'r', encoding='UTF-8') as f:
text = f.read()
words = jieba.cut(text)
result = []
for word in words:
if word not in stopwords:
result.append(word)
return ' '.join(result)
```
其中,`file_path`为需要去停用词表的文件路径,`stopwords`为加载的停用词表。
4. 对文件夹中所有txt文件进行去停用词表
```python
def remove_stopwords_folder(folder_path, stopwords):
for root, dirs, files in os.walk(folder_path):
for file in files:
if file.endswith('.txt'):
file_path = os.path.join(root, file)
result = remove_stopwords(file_path, stopwords)
with open(file_path, 'w', encoding='UTF-8') as f:
f.write(result)
```
其中,`folder_path`为需要去停用词表的文件夹路径。
最后,您只需要按照以下步骤调用上述函数即可实现加载多个停用词表并对文件夹中所有txt文件进行去停用词表:
```python
stopwords_path_list = ['stopwords1.txt', 'stopwords2.txt']
stopwords = load_stopwords(stopwords_path_list)
folder_path = 'your folder path'
remove_stopwords_folder(folder_path, stopwords)
```
其中,`stopwords_path_list`为停用词表路径列表,`folder_path`为需要去停用词表的文件夹路径。
阅读全文