用jupyter notebook爬取豆瓣电影Top250
时间: 2023-10-10 17:14:43 浏览: 110
好的,以下是爬取豆瓣电影 Top250 的代码:
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
# 定义爬取函数
def get_movies(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
movies_list = soup.find_all('div',class_='hd')
for movie in movies_list:
movie_name = movie.find('span',class_='title').string
movie_url = movie.a['href']
movie_rating = movie.find_next_sibling('div',class_='star').find('span',class_='rating_num').string
movie_info = movie.find_next_sibling('div',class_='star').find_all('span')[-1].string
movie_info = movie_info.replace('\n','').strip()
movie_director = movie_info.split('/')[0]
movie_actors = movie_info.split('/')[1:]
yield movie_name,movie_rating,movie_director,movie_actors,movie_url
# 循环爬取每一页
movies = []
for i in range(0,250,25):
url = 'https://movie.douban.com/top250?start={}&filter='.format(i)
movies.extend(list(get_movies(url)))
# 将结果转换成DataFrame
movies_df = pd.DataFrame(movies,columns=['电影名称','评分','导演','演员','电影链接'])
print(movies_df)
```
这个代码会爬取豆瓣电影 Top250 的电影名称、评分、导演、演员和电影链接,并将结果保存在一个 DataFrame 中。你可以通过 `print(movies_df)` 打印出结果。
阅读全文