python爬虫多页图片下载
时间: 2024-06-27 18:00:43 浏览: 220
python爬虫下载网站图片
Python爬虫用于下载多页图片通常涉及使用一些网络请求库(如requests)配合解析库(如BeautifulSoup或Pandas的read_html)来获取网页内容,然后使用正则表达式或者CSS选择器定位图片链接,最后使用流式下载或批量下载工具(如StreamDownloader、wget等)进行下载。以下是一个简单的步骤概述:
1. 导入必要的库:
```python
import requests
from bs4 import BeautifulSoup
import os
from urllib.parse import urljoin
```
2. 定义下载图片的函数,处理URL和保存路径:
```python
def download_image(url, save_path):
response = requests.get(url, stream=True)
if response.status_code == 200:
with open(save_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
def get_all_images(url, save_dir):
# ... (后续代码会在这里查找并下载图片)
```
3. 获取初始网页并解析出图片链接:
```python
def parse_first_page(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
image_links = [urljoin(url, img['src']) for img in soup.find_all('img', src=True)]
return image_links
initial_url = "http://example.com/page1"
first_page_images = parse_first_page(initial_url)
```
4. 递归或循环下载多页图片:
```python
def download_all_pages(url, save_dir, current_page=1):
if current_page > 1:
# 假设每页有下一页链接
next_page_url = f"{url}?page={current_page}"
next_page_images = parse_first_page(next_page_url)
for img_url in next_page_images:
download_image(img_url, os.path.join(save_dir, f"image_{current_page}_{os.path.basename(img_url)}"))
# 处理当前页面
for img_url in first_page_images:
download_image(img_url, os.path.join(save_dir, f"image_{current_page}_{os.path.basename(img_url)}"))
# 如果还有更多页面...
# 下载下一页
if has_next_page(next_page_url):
download_all_pages(next_page_url, save_dir, current_page + 1)
# 假设has_next_page是一个检查是否还有下一页的函数
```
阅读全文