python动态网页爬取读取不同text1和text2文件中的内容,获取不数字图像处理 同Cookie1和Cookie2信息,爬取相关网页图片, 并存入本地文件image1和image2 附带测试代码
时间: 2024-11-03 20:20:31 浏览: 31
Python中通过requests库可以方便地爬取网页内容,并结合BeautifulSoup解析HTML提取所需信息。对于动态加载的内容,可能需要使用Selenium来模拟浏览器行为。同时,处理文本文件(text1和text2)和保存图片到本地(image1和image2),你可以这样做:
首先,安装必要的库(如果尚未安装):
```bash
pip install requests beautifulsoup4 selenium pillow
```
然后,创建一个Python脚本,比如`web_crawler.py`:
```python
import os
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from PIL import Image
# 定义函数,用于爬取和下载图片
def download_image(url, file_name, cookies):
response = requests.get(url, cookies=cookies)
if response.status_code == 200:
with open(file_name, 'wb') as f:
f.write(response.content)
# 图片处理,这里仅保存原始图片,若需进行其他操作,添加相应代码
img = Image.open(file_name)
img.save(file_name) # 保存为原图格式
else:
print(f"Failed to download {url}, status code: {response.status_code}")
# 读取文本文件中的Cookie信息
with open('Cookie1', 'r') as cookie_file1, open('Cookie2', 'r') as cookie_file2:
cookies1 = requests.utils.cookiejar_from_dict(cookie_file1.read())
cookies2 = requests.utils.cookiejar_from_dict(cookie_file2.read())
# 爬取文本文件中提到的URLs
urls = []
for i in ['text1', 'text2']:
with open(i, 'r') as url_file:
urls.extend([line.strip() for line in url_file.readlines()])
# 使用Selenium(假设Chrome驱动已配置)打开浏览器
driver = webdriver.Chrome()
# 爬取和下载图片
for url, cookie in zip(urls, [cookies1, cookies2]):
try:
driver.get(url)
img_url = driver.find_element_by_css_selector("img") # 假设目标图片标签为<img>
image_url = img_url.get_attribute('src')
download_image(image_url, "image{}.{}".format(os.path.basename(url), 'jpg'), cookie)
except Exception as e:
print(f"Error processing {url}: {e}")
finally:
driver.quit() # 关闭浏览器
阅读全文