我下载了一个来自亚马逊的产品excel,他是这样子放的:{"reviewerID": "AO1Z63O9CTT95", "asin": "B000H29TXU", "reviewerName": "James Westervelt", "helpful": [0, 0], "reviewText": "he's OK. His humor consists mainly of varying between a Vietnamese accent and a smooth articulate presentation of ironic situations.", "overall": 2.0, "summary": "same routine he did on last comic standing", "unixReviewTime": 1394064000, "reviewTime": "03 6, 2014"}。我如何统计评论中的词频
时间: 2023-08-09 19:10:54 浏览: 172
您可以按照以下步骤统计评论中的词频:
1. 提取评论文本:从每个产品excel条目中提取"reviewText"字段的值,这将是评论的文本内容。
2. 清理文本数据:移除标点符号、数字和无关的字符,只保留字母和空格。
3. 将文本转换为小写:将所有文本转换为小写字母,以确保词频统计不受大小写影响。
4. 分割文本为单词:使用空格分割每个评论文本,将其分割为单个单词。
5. 统计词频:遍历每个评论的单词列表,计算每个单词出现的次数。
下面是一个Python代码示例,演示如何实现上述步骤:
```python
import re
from collections import Counter
def count_word_frequency(reviews):
word_freq = Counter()
for review in reviews:
review_text = review["reviewText"]
# 清洗文本数据
review_text = re.sub(r"[^a-zA-Z\s]", "", review_text)
review_text = review_text.lower()
# 分割文本为单词
words = review_text.split()
# 统计词频
word_freq.update(words)
return word_freq
# 示例数据
reviews = [
{"reviewerID": "AO1Z63O9CTT95", "asin": "B000H29TXU", "reviewerName": "James Westervelt", "helpful": [0, 0], "reviewText": "he's OK. His humor consists mainly of varying between a Vietnamese accent and a smooth articulate presentation of ironic situations.", "overall": 2.0, "summary": "same routine he did on last comic standing", "unixReviewTime": 1394064000, "reviewTime": "03 6, 2014"},
# 添加更多评论数据...
]
# 统计词频
word_frequency = count_word_frequency(reviews)
# 打印前10个常见词
most_common_words = word_frequency.most_common(10)
for word, freq in most_common_words:
print(f"{word}: {freq}")
```
请注意,上述代码仅为示例,您可能需要根据实际数据格式和需求进行适当的调整。
阅读全文