选择豆瓣网站中《肖申克的救赎》,采集影评数据; 使用可视化对数据进行分析:评论数量趋势图、评论时间分布图、评分分布图、高分评论词云、低分评论词云等; 使用Markdown完整说明整体思路。
时间: 2023-12-25 12:05:50 浏览: 290
首先,你需要安装一些必要的库,如 requests, BeautifulSoup, pandas, matplotlib 和 wordcloud。你可以通过以下命令来安装它们:
```python
pip install requests beautifulsoup4 pandas matplotlib wordcloud
```
然后,你可以使用以下代码来爬取《肖申克的救赎》的影评数据:
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
# 发起请求并获取页面内容
url = 'https://movie.douban.com/subject/1292052/reviews'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
content = response.text
# 使用BeautifulSoup解析页面内容
soup = BeautifulSoup(content, 'html.parser')
reviews = soup.find_all('div', class_='review-item')
# 提取评论数据
data = []
for review in reviews:
rating = review.find('span', class_='rating')['title'] if review.find('span', class_='rating') else ''
comment = review.find('div', class_='short-content').text.strip()
data.append({'rating': rating, 'comment': comment})
# 将数据转换为DataFrame
df = pd.DataFrame(data)
# 保存数据为CSV文件
df.to_csv('reviews.csv', index=False)
```
接下来,你可以使用以下代码进行数据可视化分析:
```python
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# 加载评论数据
df = pd.read_csv('reviews.csv')
# 绘制评论数量趋势图
df['date'] = pd.to_datetime(df['date'])
df['date'].value_counts().sort_index().plot(kind='line')
plt.xlabel('Date')
plt.ylabel('Number of Comments')
plt.title('Trend of Comment Quantity')
plt.show()
# 绘制评论时间分布图
df['hour'] = df['date'].dt.hour
df['hour'].value_counts().sort_index().plot(kind='bar')
plt.xlabel('Hour')
plt.ylabel('Number of Comments')
plt.title('Distribution of Comment Time')
plt.show()
# 绘制评分分布图
df['rating'].value_counts().sort_index().plot(kind='bar')
plt.xlabel('Rating')
plt.ylabel('Number of Comments')
plt.title('Distribution of Ratings')
plt.show()
# 生成高分评论词云
high_rating_comments = ' '.join(df[df['rating'] >= 8]['comment'])
wordcloud = WordCloud(background_color='white').generate(high_rating_comments)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of High Rating Comments')
plt.show()
# 生成低分评论词云
low_rating_comments = ' '.join(df[df['rating'] <= 3]['comment'])
wordcloud = WordCloud(background_color='white').generate(low_rating_comments)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Low Rating Comments')
plt.show()
```
以上就是整体的思路,你可以将代码保存为一个.py文件,然后运行它来进行数据采集和可视化分析。希望对你有所帮助!
阅读全文