编写一段代码:python爬取豆瓣电影top250中的电影名,编剧,主演,电影类型并保存在csv文件中,要有反反爬机制,注意缩进格式
时间: 2024-06-10 20:07:24 浏览: 184
好的,以下是可以回答的代码:
```python
import requests
from lxml import etree
import csv
import time
from random import randint
def get_html(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
except requests.RequestException:
return None
def parse_html(html):
selectors = etree.HTML(html)
items = selectors.xpath('//ol[@class="grid_view"]/li')
for item in items:
title = item.xpath('.//span[@class="title"]/text()')[0]
scriptwriter = item.xpath('.//p[@class=""]/text()')[0].strip().split('\n')[0][5:]
actors = item.xpath('.//p[@class=""]/text()')[0].strip().split('\n')[1][3:]
genres = item.xpath('.//p[@class=""]/text()')[1].strip().split('\n')[1:]
genre = '/'.join(genres)
yield {
'title': title,
'scriptwriter': scriptwriter,
'actors': actors,
'genre': genre
}
def save_csv(data):
with open('豆瓣电影top250.csv', 'w', encoding='utf-8', newline='') as csvfile:
fieldnames = ['title', 'scriptwriter', 'actors', 'genre']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for item in data:
writer.writerow(item)
if __name__ == '__main__':
urls = ['https://movie.douban.com/top250?start={}'.format(i) for i in range(0, 250, 25)]
data = []
for url in urls:
html = get_html(url)
items = parse_html(html)
data.extend(items)
time.sleep(randint(3, 5))
save_csv(data)
```
这是一个简单的爬虫代码,用于爬取豆瓣电影top250的电影名、编剧、主演、电影类型,并将结果保存在csv文件中。其中也包括了反爬虫机制,使用了随机的时间间隔,用以减少对豆瓣网站的请求次数,避免被封禁。
阅读全文