不要简化步骤,请给详细代码
时间: 2024-10-23 10:08:25 浏览: 9
当然,以下是详细的Python代码示例,包含了所有步骤,包括异常处理和分页抓取(如果需要的话)。请注意,由于豆瓣网站可能会有反爬策略,实际运行时请遵守网站的robots.txt协议,并适当设置延时以减少对服务器的压力。
```python
import requests
from bs4 import BeautifulSoup
import time
import csv
import pandas as pd
# 分页函数
def fetch_page(start):
url = f"https://movie.douban.com/top250?start={start}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Error: {response.status_code}")
return None
return BeautifulSoup(response.text, 'lxml')
# 主体抓取和保存数据
data_list = []
for i in range(0, 250, 25): # 每次抓取25部电影,模拟总共有10页
page_soup = fetch_page(i)
if not page_soup:
break
movies_in_page = page_soup.find_all('div', class_='hd')
for movie in movies_in_page:
title = movie.h3.a['title']
director = movie.find('span', attrs={'property': 'v:director'}).text
writer = movie.find('span', attrs={'itemprop': 'writer'}).text
stars = movie.find('span', attrs={'itemprop': 'actors'}).text
genres = movie.find('span', itemprop='genre').text
release_date = movie.find('span', itemprop='datePublished').text
runtime = movie.find('span', text=lambda text: '时' in text).text
rating_count = movie.find('span', attrs={'class': 'rating_num'}).text
try:
brief_intro = movie.find('p', itemprop='description').text
except AttributeError:
brief_intro = "无剧情简介"
data_list.append({
'标题': title,
'导演': director,
'编剧': writer,
'主演': stars,
'类型': genres,
'上映时间': release_date,
'片长': runtime,
'评分人数': rating_count,
'剧情简介': brief_intro
})
# 设置延迟以避免过于频繁的请求
time.sleep(1)
# 将数据保存到CSV文件
df_movies = pd.DataFrame(data_list)
df_movies.to_csv('豆瓣电影Top250详情.csv', index=False)
```
阅读全文