编写简单的爬虫程序,使用request、urlib、lxml,爬取http://www.netbian.com/页面上10幅图像,并保存到本地文件夹中
时间: 2024-12-04 15:26:37 浏览: 0
要编写一个简单的爬虫程序以爬取 http://www.netbian.com/ 页面上的10幅图像并保存到本地文件夹,我们可以按照以下步骤进行:
1. 导入所需库[^1]:
```python
import requests
from lxml import etree
import os
```
2. 定义目标URL:
```python
url = "http://www.netbian.com/"
```
3. 设置请求头(可能需要针对网站设置适当的User-Agent):
```python
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
```
4. 发送GET请求并获取页面内容:
```python
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
print("成功获取源代码")
except requests.exceptions.RequestException as e:
print(f"获取源代码失败: {e}")
```
5. 解析HTML内容找到图像元素:
```python
tree = etree.HTML(response.text)
image_elements = tree.xpath('//img')[:10] # 取前10个img元素
```
6. 保存图像到本地:
```python
if image_elements:
images_folder = 'images'
if not os.path.exists(images_folder):
os.makedirs(images_folder)
for i, img_element in enumerate(image_elements, start=1): # 序号从1开始
img_url = img_element.attrib['src']
img_name = f'image_{i}.jpg' # 根据序号命名
file_path = os.path.join(images_folder, img_name)
try:
with open(file_path, 'wb') as f:
img_data = requests.get(img_url).content
f.write(img_data)
print(f"已保存图像到: {file_path}")
except Exception as e:
print(f"保存图像失败: {e}")
else:
print("未找到任何图像元素")
```
阅读全文