python爬取网页图片并放到指定文件夹同时获取网页中的Cookie同时用到re.findall
时间: 2024-10-18 21:03:57 浏览: 20
在Python中,我们可以使用requests库来发送HTTP请求、BeautifulSoup库来解析HTML文档,以及os库来管理文件和目录。爬取网页图片的同时获取Cookies并使用正则表达式`re.findall`,你可以按照以下步骤操作:
1. 导入所需库:
```python
import requests
from bs4 import BeautifulSoup
import os
import re
```
2. 发送GET请求获取网页内容及Cookies:
```python
def get_html_and_cookies(url):
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
cookies = dict(response.cookies)
html = response.text
return html, cookies
url = '目标网页URL'
html, cookies = get_html_and_cookies(url)
```
3. 使用BeautifulSoup解析HTML找到所有图片链接:
```python
soup = BeautifulSoup(html, 'lxml')
img_tags = soup.find_all('img') # 根据实际情况替换标签名
image_links = [img['src'] for img in img_tags]
```
4. 爬取图片并保存到指定文件夹:
```python
if not os.path.exists('images'): # 创建图片文件夹,如果不存在
os.makedirs('images')
for link in image_links:
try:
response = requests.get(link, stream=True, cookies=cookies)
filename = os.path.join('images', os.path.basename(link)) # 获取图片名称
with open(filename, 'wb') as f:
for chunk in response.iter_content(1024):
f.write(chunk)
except Exception as e:
print(f"下载图片失败: {e}")
```
5. 使用正则表达式提取特定信息(这里假设你想找所有的邮箱地址):
```python
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(email_pattern, html)
print("Found emails:", emails)
```
阅读全文