写一个爬虫程序,要求:1、用selenium爬取,2、多页爬取,3、获取每页数据位置的href,将href存储在列表里,4、对每个href再次进行爬取,5、提取数据,存储到csv文件中,4、该爬虫用分布式爬取
时间: 2024-06-09 19:07:43 浏览: 171
好的,以下是一个使用selenium和Scrapy-Redis实现的分布式爬虫示例:
首先安装必要的库:
```
pip install scrapy scrapy-redis selenium
```
接着创建一个Scrapy项目:
```
scrapy startproject myspider
cd myspider
```
修改settings.py文件,添加以下内容:
```python
# 使用Redis调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 使用Redis去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 允许暂停/恢复爬虫
SCHEDULER_PERSIST = True
# 使用优先级调度请求队列
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.PriorityQueue"
# 设置Redis为item pipeline
ITEM_PIPELINES = {"scrapy_redis.pipelines.RedisPipeline": 300}
# 设置Redis连接信息
REDIS_HOST = "127.0.0.1"
REDIS_PORT = 6379
REDIS_PARAMS = {"password": "yourpassword"}
# 设置浏览器驱动路径
SELENIUM_DRIVER_PATH = "/path/to/chromedriver"
# 设置User-Agent池
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; AS; rv:11.0) like Gecko",
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"
]
# 设置请求头池
HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Pragma": "no-cache",
"Referer": "https://www.google.com/",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": random.choice(USER_AGENTS)
}
```
创建一个Spider,并在其中使用selenium模拟浏览器操作:
```python
import scrapy
import random
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.keys import Keys
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["https://www.example.com/"]
def start_requests(self):
for url in self.start_urls:
yield SeleniumRequest(
url=url,
callback=self.parse,
headers=self.settings.get("HEADERS"),
wait_time=10,
screenshot=True,
errback=self.handle_failure
)
def parse(self, response):
# 提取第一页的href列表
href_list = response.xpath("//a/@href")
for href in href_list:
# 将href存储到Redis队列中
yield scrapy.Request(
url=href,
meta={"selenium": True},
headers=self.settings.get("HEADERS"),
priority=1
)
# 模拟翻页操作
driver = response.meta["driver"]
next_page = driver.find_element_by_xpath("//a[@class='next-page']")
if next_page:
next_page.send_keys(Keys.ENTER)
yield SeleniumRequest(
url=driver.current_url,
callback=self.parse,
headers=self.settings.get("HEADERS"),
wait_time=10,
screenshot=True,
errback=self.handle_failure
)
def handle_failure(self, failure):
# 处理请求失败的情况
self.logger.error(failure)
```
创建一个Item,并在其中存储数据:
```python
import scrapy
class ExampleItem(scrapy.Item):
title = scrapy.Field()
content = scrapy.Field()
url = scrapy.Field()
def __str__(self):
return f"{self['title']}: {self['content']}"
```
创建一个Pipeline,并在其中存储数据到csv文件中:
```python
import csv
class ExamplePipeline:
def open_spider(self, spider):
self.file = open("data.csv", "w", newline="", encoding="utf-8")
self.writer = csv.writer(self.file)
self.writer.writerow(["title", "content", "url"])
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
self.writer.writerow([item["title"], item["content"], item["url"]])
return item
```
最后在命令行启动分布式爬虫:
```
scrapy crawl example -s JOBDIR=crawls/example-1
```
其中`JOBDIR`参数表示断点续爬的状态保存路径。
阅读全文