首页Python一段英文文本，统计其中所有不同单词的个数，以及词频最大的前10%的单词。所谓“单词”，是指由不超过15个单词字符组成的字符串（长度超过15的单词将只截取保留前 15个单词字符）。“单词字符”为大小写字母、数字和下划线，其它字符均认为是单词分隔符

Python一段英文文本，统计其中所有不同单词的个数，以及词频最大的前10%的单词。所谓“单词”，是指由不超过15个单词字符组成的字符串（长度超过15的单词将只截取保留前 15个单词字符）。“单词字符”为大小写字母、数字和下划线，其它字符均认为是单词分隔符

时间: 2024-10-15 21:13:05 浏览: 90

在Python中，你可以使用`collections`库的`Counter`类来统计单词频率，然后配合`itertools`库计算出词频最大的前10%的单词。以下是一个简单的示例： ```python import re from collections import Counter from itertools import islice def count_words(text): # 使用正则表达式去除标点并分割单词，考虑了长度限制和特殊字符处理 word_pattern = r'\b\w{1,15}\b' words = re.findall(word_pattern, text) # 使用Counter计算单词频率 word_counts = Counter(words) # 计算词频的总次数 total_count = sum(word_counts.values()) # 获取词频最大的前10%的单词 top_10_percent = [word for word, freq in islice(word_counts.items(), int(total_count * 0.1))] return word_counts, top_10_percent # 测试文本 text = "This is a sample text with some long words like 'programming', 'Python', and _12345_. It should demonstrate how to find the most frequent words." word_counts, top_freq_words = count_words(text) print("Word counts:", word_counts) print("Top 10% frequent words:", top_freq_words) ``` 在这个例子中，我们首先通过正则表达式提取文本中的单词，并使用`Counter`统计每个单词出现的次数。接着计算出整个文本中单词总数的10%，并将频率最高的这些单词找出来。

阅读全文