单词词频python
时间: 2023-11-08 18:05:58 浏览: 26
可以使用Python中的collections模块中的Counter类来统计单词词频。具体实现步骤如下:
1. 读取文本文件内容
2. 将文本内容转换为小写字母
3. 使用正则表达式去除标点符号和数字
4. 将文本内容按照空格分割成单词列表
5. 使用Counter类统计单词出现次数
以下是示例代码:
```python
import re
from collections import Counter
with open('text.txt', 'r') as f:
text = f.read()
text = text.lower()
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\d+', '', text)
words = text.split()
word_counts = Counter(words)
print(word_counts.most_common(10))
```
以上代码将输出出现频率最高的前10个单词及其出现次数。
相关问题
mapreduce词频python
在Python中使用MapReduce模式计算词频可以很方便地实现。下面是一个简单的示例代码:
```python
from collections import Counter
from multiprocessing import Pool
def mapper(text):
words = text.split()
return Counter(words)
def reducer(counters):
return sum(counters, Counter())
def map_reduce(data, num_processes):
pool = Pool(num_processes)
mapped_data = pool.map(mapper, data)
reduced_data = reducer(mapped_data)
pool.close()
pool.join()
return reduced_data
if __name__ == "__main__":
# 假设有一些文本数据
data = [
"Hello world",
"Hello CSDN",
"CSDN is great",
"Hello python"
]
# 使用MapReduce计算词频
word_count = map_reduce(data, 2)
# 输出结果
for word, count in word_count.items():
print(f"{word}: {count}")
```
运行以上代码,你将会得到每个单词出现的次数。这是一个简单的MapReduce实现,你可以根据你的需求进行修改和扩展。
词频统计python
词频统计是指统计一段文本中各个单词出现的频率。在Python中,可以使用不同的方法进行词频统计。以下是几种常见的方法:
1. 原始字典自写代码统计:
```python
wordcount = {}
for word in all_words:
wordcount[word] = wordcount.get(word, 0) + 1
sorted_wordcount = sorted(wordcount.items(), key=lambda x: x[1], reverse=True)
```
2. 使用第三方库jieba进行中文词频统计:
```python
import jieba
from collections import Counter
wordcount = Counter()
for word in jieba.cut(text):
if len(word) > 1 and word not in stop_words:
wordcount[word] += 1
sorted_wordcount = wordcount.most_common(10)
```
3. 使用原生API进行英文词频统计:
```python
speech = speech_text.lower().split()
wordcount = {}
for word in speech:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
sorted_wordcount = sorted(wordcount.items(), key=lambda x: x[1], reverse=True)[:10]
```