python词频统计txt
时间: 2024-11-04 08:06:19 浏览: 14
python实现文章词频统计
在Python中,我们可以使用内置的`collections`模块或者第三方库如`nltk`(Natural Language Toolkit)来进行文本文件(如.txt格式)的词频统计。以下是使用这两个方法的基本步骤:
**使用collections模块**:
```python
from collections import Counter
import re
def word_frequency(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
text = file.read()
# 使用正则表达式去除标点符号和换行符,并转换为小写
words = re.findall(r'\b\w+\b', text.lower())
# 使用Counter计算词频
word_counts = Counter(words)
return word_counts
# 示例
word_counts = word_frequency('your_text_file.txt')
```
**使用nltk库**:
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def word_freq_from_nltk(file_path):
nltk.download('punkt') # 如果第一次使用,需要下载punkt数据集
nltk.download('stopwords') # 下载停用词列表
with open(file_path, 'r', encoding='utf-8') as file:
text = file.read()
stop_words = set(stopwords.words('chinese')) # 或者英文,看你的文本内容
words = [word for word in word_tokenize(text) if word.isalnum() and not word in stop_words]
freq_dist = nltk.FreqDist(words)
return freq_dist
# 示例
freq_distribution = word_freq_from_nltk('your_text_file.txt')
```
阅读全文