无效评论过滤、分词、去除停用词,并生成词云图 Python完整代码
时间: 2024-10-11 19:03:31 浏览: 69
在Python中处理文本数据,如无效评论过滤、分词、去除停用词和生成词云图,通常涉及以下几个步骤。这里是一个简单的示例:
```python
# 首先,需要安装必要的库,比如nltk(自然语言工具包),jieba(分词库),wordcloud(词云库)和pandas等
!pip install nltk jieba wordcloud pandas matplotlib
import jieba # 分词
from collections import Counter # 统计词频
import re # 正则表达式用于无效评论过滤
from nltk.corpus import stopwords # 停用词表
from wordcloud import WordCloud # 词云图
import matplotlib.pyplot as plt # 可视化库
import pandas as pd # 数据处理库
# 1. 加载数据(假设从列表或CSV文件加载)
comments = [...] # 这里可以是评论字符串列表或DataFrame的text列
# 2. 无效评论过滤
def filter_invalid_comments(comments):
pattern = r'[^\u4e00-\u9fa5a-zA-Z0-9\s]' # 去除非汉字、字母和数字的字符
return [re.sub(pattern, '', comment).strip() for comment in comments]
filtered_comments = filter_invalid_comments(comments)
# 3. 分词并移除停用词
stop_words = set(stopwords.words('chinese')) # 使用中文停用词表
def tokenize_and_remove_stop_words(comment):
words = jieba.lcut(comment)
filtered_words = [word for word in words if word not in stop_words]
return ' '.join(filtered_words)
tokenized_comments = [tokenize_and_remove_stop_words(comment) for comment in filtered_comments]
# 4. 统计词频
counter = Counter([word for comment in tokenized_comments for word in comment.split()])
# 5. 生成词云图
wordcloud = WordCloud(width=800, height=600, background_color='white', min_font_size=10).generate_from_frequencies(counter)
plt.figure(figsize=(10, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("评论关键词")
plt.show()
阅读全文