python爬取豆瓣电影top250并可视化分析,代码实现
时间: 2023-08-05 10:52:34 浏览: 232
基于Python的豆瓣top250电影爬虫与数据可视化分析设计源码
5星 · 资源好评率100%
好的,以下是Python爬取豆瓣电影Top250并可视化分析的代码实现:
首先,我们需要导入必要的库:
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
```
接着,我们需要编写爬取Top250电影信息的函数:
```python
def get_movies():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
movie_list = []
for i in range(0, 250, 25):
url = 'https://movie.douban.com/top250?start={}&filter='.format(i)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='info')
for movie in movies:
title = movie.find('span', class_='title').text
director = movie.find('div', class_='bd').p.text.split('\n')[1].strip().split(':')[1]
star = movie.find('div', class_='star').text.strip().split('\n')[0]
quote = movie.find('span', class_='inq').text if movie.find('span', class_='inq') else ''
movie_list.append({'title': title, 'director': director, 'star': star, 'quote': quote})
return movie_list
```
该函数使用requests库向豆瓣电影Top250网页发送请求,并使用BeautifulSoup库解析网页内容。然后,它从网页中找到所有电影信息所在的div标签,并使用循环遍历每个电影,并提取出电影的标题、导演、评分和简介。最后,函数将所有电影信息存储在一个列表中,并返回该列表。
接下来,我们可以调用该函数并将电影信息存储在一个DataFrame中:
```python
movies = get_movies()
df = pd.DataFrame(movies)
```
现在,我们可以使用pandas库对数据进行分析和可视化。例如,我们可以计算每个导演的电影数量,并使用matplotlib库绘制一个水平条形图:
```python
director_count = df['director'].value_counts().sort_values(ascending=True)
plt.barh(director_count.index, director_count.values)
plt.title('Number of Movies by Director')
plt.xlabel('Number of Movies')
plt.show()
```
我们还可以计算不同评分的电影数量,并使用饼图将其可视化:
```python
star_count = df['star'].value_counts()
plt.pie(star_count.values, labels=star_count.index, autopct='%1.1f%%')
plt.title('Distribution of Movie Ratings')
plt.show()
```
最后,我们可以将所有电影信息保存到一个CSV文件中:
```python
df.to_csv('movies.csv', index=False)
```
完整代码如下:
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
def get_movies():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
movie_list = []
for i in range(0, 250, 25):
url = 'https://movie.douban.com/top250?start={}&filter='.format(i)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='info')
for movie in movies:
title = movie.find('span', class_='title').text
director = movie.find('div', class_='bd').p.text.split('\n')[1].strip().split(':')[1]
star = movie.find('div', class_='star').text.strip().split('\n')[0]
quote = movie.find('span', class_='inq').text if movie.find('span', class_='inq') else ''
movie_list.append({'title': title, 'director': director, 'star': star, 'quote': quote})
return movie_list
movies = get_movies()
df = pd.DataFrame(movies)
director_count = df['director'].value_counts().sort_values(ascending=True)
plt.barh(director_count.index, director_count.values)
plt.title('Number of Movies by Director')
plt.xlabel('Number of Movies')
plt.show()
star_count = df['star'].value_counts()
plt.pie(star_count.values, labels=star_count.index, autopct='%1.1f%%')
plt.title('Distribution of Movie Ratings')
plt.show()
df.to_csv('movies.csv', index=False)
```
阅读全文