python爬虫批量下载文件
时间: 2024-09-21 12:07:17 浏览: 51
Python爬虫批量下载文件通常涉及使用requests库来发送HTTP请求获取网页内容,并利用BeautifulSoup或Scrapy等解析库提取下载链接,然后通过os模块对每个链接创建新的文件流并保存到本地。以下是一个简单的步骤概述:
1. 导入必要的库:
```python
import os
import requests
from bs4 import BeautifulSoup
```
2. 定义要爬取的网址列表或函数:
```python
def get_download_links(url):
# 使用requests.get获取页面内容,然后解析HTML
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 找到包含下载链接的元素
download_links = soup.select('a[target="_blank"]') # 根据实际情况修改选择器
return [link['href'] for link in download_links]
```
3. 创建目录(如果不存在),并遍历下载链接进行下载:
```python
base_directory = "downloads"
if not os.path.exists(base_directory):
os.makedirs(base_directory)
urls_to_crawl = ['http://example.com/page1', 'http://example.com/page2'] # 替换为你需要爬取的URL
for url in urls_to_crawl:
links = get_download_links(url)
for link in links:
filename = os.path.join(base_directory, os.path.basename(link)) # 获取文件名
with requests.get(link, stream=True) as r:
r.raise_for_status()
with open(filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192): # 分块下载
f.write(chunk)
```
阅读全文