python对journey_to_the_west.txt的网络文本,进行分词、停用词过滤、词频统计
时间: 2024-01-06 15:04:41 浏览: 165
好的,我可以帮你使用Python对《西游记》进行分词、停用词过滤和词频统计。不过需要先下载停用词表。
步骤如下:
1. 下载停用词表,可以从中文自然语言处理工具包NLTK中下载,也可以在GitHub上找到。这里我使用的是GitHub上的停用词表。
```python
import urllib.request
import os
if not os.path.exists('stopwords.txt'):
print('Downloading stopwords...')
url = 'https://raw.githubusercontent.com/goto456/stopwords/master/stopwords.txt'
urllib.request.urlretrieve(url, 'stopwords.txt')
print('Stopwords download complete.')
```
2. 读取《西游记》文本
```python
with open('journey_to_the_west.txt', 'r', encoding='utf-8') as f:
text = f.read()
```
3. 对文本进行分词,这里使用jieba分词库
```python
import jieba
words = jieba.lcut(text)
```
4. 进行停用词过滤
```python
with open('stopwords.txt', 'r', encoding='utf-8') as f:
stopwords = f.read().split('\n')
filtered_words = []
for word in words:
if word not in stopwords and word != '\n':
filtered_words.append(word)
```
5. 进行词频统计
```python
from collections import Counter
word_count = Counter(filtered_words)
print(word_count.most_common(20))
```
完整代码如下:
```python
import urllib.request
import os
import jieba
from collections import Counter
if not os.path.exists('stopwords.txt'):
print('Downloading stopwords...')
url = 'https://raw.githubusercontent.com/goto456/stopwords/master/stopwords.txt'
urllib.request.urlretrieve(url, 'stopwords.txt')
print('Stopwords download complete.')
with open('journey_to_the_west.txt', 'r', encoding='utf-8') as f:
text = f.read()
words = jieba.lcut(text)
with open('stopwords.txt', 'r', encoding='utf-8') as f:
stopwords = f.read().split('\n')
filtered_words = []
for word in words:
if word not in stopwords and word != '\n':
filtered_words.append(word)
word_count = Counter(filtered_words)
print(word_count.most_common(20))
```
这段代码会输出前20个出现次数最多的词及其出现次数。
阅读全文