写出分析豆瓣电影短评数据数据分析及可视化的代码
时间: 2024-11-03 13:23:02 浏览: 19
分析豆瓣电影短评数据通常涉及数据清洗、文本预处理、情感分析以及可视化等多个步骤。这里提供一个简单的Python代码示例,使用Pandas处理CSV数据,NLTK进行文本预处理,TextBlob做情感分析,并通过Matplotlib或Seaborn进行结果展示。
```python
# 导入所需库
import pandas as pd
import numpy as np
from textblob import TextBlob
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# 加载数据
data = pd.read_csv('douban_movie_reviews.csv')
# 数据清洗
data['review_text'] = data['review_text'].str.lower() # 转小写
data.dropna(inplace=True) # 删除缺失值
# 文本预处理
stop_words = set(stopwords.words('chinese'))
data['cleaned_review'] = data['review_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
# 情感分析
def sentiment_analysis(text):
return TextBlob(text).sentiment.polarity
data['sentiment_score'] = data['cleaned_review'].apply(sentiment_analysis)
# 可视化
# 平均评分分布
plt.figure(figsize=(10, 6))
plt.hist(data['sentiment_score'], bins=20)
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.title('Distribution of Movie Review Sentiments')
plt.show()
# 热词云
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=stop_words).generate(' '.join(data['cleaned_review']))
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Top Keywords in Movie Reviews')
plt.show()
阅读全文