爬取豆瓣top250电影详细信息的python源码
时间: 2023-10-24 12:06:22 浏览: 124
基于python的豆瓣top250电影榜单信息爬取源码(注释拉满)-python课程作业.zip
以下是爬取豆瓣top250电影详细信息的Python源码:
```python
import requests
from bs4 import BeautifulSoup
def get_movie_details(url):
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
# 电影名
name = soup.select_one('h1 span').text.strip()
# 导演、演员、类型
info = soup.select_one('div#info').text.strip().split('\n')
director = info[0][4:].strip()
actors = [actor.strip() for actor in info[1][4:].split('/')]
genre = [genre.strip() for genre in info[2][4:].split('/')]
# 评分
rating = soup.select_one('strong.rating_num').text.strip()
# 评价人数
rating_num = soup.select_one('span.rating_people span').text.strip()[:-3]
# 简介
summary = soup.select_one('span.all.hidden').text.strip()
# 返回字典
return {
'name': name,
'director': director,
'actors': actors,
'genre': genre,
'rating': rating,
'rating_num': rating_num,
'summary': summary
}
def get_top250_movies():
movies = []
for i in range(10):
url = 'https://movie.douban.com/top250?start=' + str(i*25) + '&filter='
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.select('ol.grid_view li'):
# 电影详情链接
link = item.select_one('div.hd a')['href']
movies.append(get_movie_details(link))
return movies
if __name__ == '__main__':
top250_movies = get_top250_movies()
for movie in top250_movies:
print(movie)
```
该代码使用requests库和BeautifulSoup库,通过访问豆瓣电影top250页面,获取每个电影的详细信息。其中,`get_movie_details()`函数用于获取单个电影的详细信息,`get_top250_movies()`函数用于获取豆瓣电影top250列表中所有电影的详细信息。最后,将获取到的电影信息以字典形式存储,并打印输出。
阅读全文