python爬取豆瓣电影top250数据可视化并存入csv
时间: 2023-08-05 07:52:27 浏览: 248
好的,我可以为您提供代码参考。首先需要安装相关的库:requests、BeautifulSoup、pandas、matplotlib。
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
# 获取电影信息
def get_movie_info(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
movie_list = soup.find_all('div', class_='hd')
info_list = []
for movie in movie_list:
title = movie.a.span.text.strip() # 电影名称
link = movie.a['href'] # 电影链接
rating = movie.parent.find('span', class_='rating_num').text # 评分
rating_num = movie.parent.find('div', class_='star').find_all('span')[-1].text[:-3] # 评分人数
info_list.append([title, link, float(rating), int(rating_num)])
return info_list
# 获取所有电影信息
def get_all_movie_info():
url = 'https://movie.douban.com/top250'
info_list = []
for i in range(10):
page_url = url + f'?start={i*25}'
info_list += get_movie_info(page_url)
return info_list
# 可视化电影评分分布
def plot_rating_distribution(data):
plt.hist(data, bins=10, range=(0, 10), color='b', alpha=0.5)
plt.xlabel('Rating')
plt.ylabel('Number of Movies')
plt.title('Distribution of Movie Ratings')
plt.savefig('rating_distribution.png')
plt.show()
# 存储电影信息到csv文件
def save_to_csv(data):
df = pd.DataFrame(data, columns=['Title', 'Link', 'Rating', 'RatingNum'])
df.to_csv('movie_info.csv', index=False)
if __name__ == '__main__':
movie_info = get_all_movie_info()
save_to_csv(movie_info)
ratings = [info[2] for info in movie_info]
plot_rating_distribution(ratings)
```
以上代码中,`get_movie_info(url)`函数用于获取单页电影信息,`get_all_movie_info()`函数用于获取所有电影信息,`save_to_csv(data)`函数用于将数据存储到csv文件中,`plot_rating_distribution(data)`函数用于绘制电影评分分布直方图。最后在主函数中调用这些函数即可。
阅读全文