帮我写一下用scrapy和selenium爬取猫眼top100的详细信息的代码
时间: 2023-06-03 14:01:35 浏览: 140
好的,以下是使用Scrapy和Selenium爬取猫眼top100详细信息的Python代码:
```python
import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from scrapy.selector import Selector
from scrapy.http import TextResponse
class MaoyanSpider(scrapy.Spider):
name = 'maoyan'
allowed_domains = ['maoyan.com']
start_urls = ['https://maoyan.com/board/4']
def __init__(self):
chrome_options = Options()
chrome_options.add_argument('--headless')
self.driver = webdriver.Chrome(chrome_options=chrome_options)
def parse(self, response):
self.driver.get(response.url)
response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
movies = response.xpath('//div[@class="movie-item-info"]')
for movie in movies:
item = {}
item['rank'] = movie.xpath('div[@class="board-index"]/text()').extract_first().strip()
item['title'] = movie.xpath('div[@class="movie-item-info"]/p[@class="name"]/a/@title').extract_first().strip()
item['star'] = movie.xpath('div[@class="movie-item-info"]/p[@class="star"]/text()').extract_first().strip()
item['time'] = movie.xpath('div[@class="movie-item-info"]/p[@class="releasetime"]/text()').extract_first().strip()
yield scrapy.Request(url='https://maoyan.com'+movie.xpath('div[@class="movie-item-info"]/p[@class="name"]/a/@href').extract_first(),
meta={'item': item},
callback=self.parse_detail)
if response.xpath('//a[@class="next"]'):
next_page_url = response.xpath('//a[@class="next"]/@href')
yield scrapy.Request(url='https://maoyan.com'+next_page_url.extract_first(),
callback=self.parse)
def parse_detail(self, response):
item = response.meta['item']
response = Selector(text=self.driver.page_source)
item['type'] = response.xpath('//div[@class="movie-brief-container"]/ul/li[1]/text()').extract()
item['length'] = response.xpath('//div[@class="movie-brief-container"]/ul/li[2]/text()').extract()
yield item
```
这个代码利用Scrapy框架和Selenium库实现了猫眼电影top100的爬取,其中使用了Chrome浏览器进行模拟访问。可以直接运行在Python环境下,需要安装Scrapy和Selenium依赖库。
阅读全文