编写简单的爬虫程序,使用request、urlib,不使用os,爬取http://www.netbian.com/页面上10幅图像,并保存到本地文件夹中
时间: 2024-12-04 08:26:38 浏览: 23
在Python中,你可以使用`urllib.request`来实现基本的HTTP请求并下载图片。但是,正如你提到的,`requests`库通常更方便。下面是一个使用`requests`的简单示例,展示如何爬取指定网页上的图片并保存到本地:
```python
import requests
from os.path import basename
# 目标URL
url = "http://www.netbian.com/"
# 模拟浏览器访问,获取HTML内容
response = requests.get(url)
# 解析HTML以找到图片链接
html = response.text
img_links = [link for link in html.split('<img') if 'src' in link]
# 下载图片并保存到本地
for i, img_link in enumerate(img_links[:10]):
img_url = img_link.split('src="')[1].split('"')[0]
img_name = f"image_{i}.jpg" # 假设图片命名格式
response = requests.get(img_url)
with open(f"{basename(img_name)}", "wb") as f:
f.write(response.content)
相关问题
编写简单的爬虫程序,使用request、urlib、lxml,爬取http://www.netbian.com/页面上10幅图像,并保存到本地文件夹中
要编写一个简单的爬虫程序以爬取 http://www.netbian.com/ 页面上的10幅图像并保存到本地文件夹,我们可以按照以下步骤进行:
1. 导入所需库[^1]:
```python
import requests
from lxml import etree
import os
```
2. 定义目标URL:
```python
url = "http://www.netbian.com/"
```
3. 设置请求头(可能需要针对网站设置适当的User-Agent):
```python
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
```
4. 发送GET请求并获取页面内容:
```python
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
print("成功获取源代码")
except requests.exceptions.RequestException as e:
print(f"获取源代码失败: {e}")
```
5. 解析HTML内容找到图像元素:
```python
tree = etree.HTML(response.text)
image_elements = tree.xpath('//img')[:10] # 取前10个img元素
```
6. 保存图像到本地:
```python
if image_elements:
images_folder = 'images'
if not os.path.exists(images_folder):
os.makedirs(images_folder)
for i, img_element in enumerate(image_elements, start=1): # 序号从1开始
img_url = img_element.attrib['src']
img_name = f'image_{i}.jpg' # 根据序号命名
file_path = os.path.join(images_folder, img_name)
try:
with open(file_path, 'wb') as f:
img_data = requests.get(img_url).content
f.write(img_data)
print(f"已保存图像到: {file_path}")
except Exception as e:
print(f"保存图像失败: {e}")
else:
print("未找到任何图像元素")
```
编写简单的爬虫程序,使用request、urlib、lxml,不使用os,爬取http://www.netbian.com/页面上10幅图像,并保存到本地文件夹中
要编写一个简单的爬虫程序,使用Python的`requests`, `urllib.request`, 和 `lxml` 库来抓取网页并保存图片,你可以按照以下步骤操作:
1. **安装所需库**:
```bash
pip install requests lxml
```
2. **导入所需的库**:
```python
import requests
from bs4 import BeautifulSoup
import urllib.request
```
3. **定义URL和下载路径**:
```python
url = "http://www.netbian.com/"
download_path = "./images" # 创建一个用于存储图片的目录(如果不存在)
if not os.path.exists(download_path):
os.makedirs(download_path)
```
4. **发送GET请求并解析HTML**:
```python
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
```
5. **找到图像标签并提取src属性**:
```python
img_tags = soup.find_all("img") # 这里假设所有的图片都在<img>标签内
img_urls = [img["src"] for img in img_tags]
```
6. **下载图片**:
```python
for i, img_url in enumerate(img_urls[:10]): # 取前10张图片
image_name = f"image_{i+1}.jpg" # 图片名称,可以根据实际需求修改
full_path = os.path.join(download_path, image_name)
try:
with urllib.request.urlopen(img_url) as img_response:
with open(full_path, "wb") as file:
file.write(img_response.read())
except Exception as e:
print(f"Failed to download {img_url}: {e}")
```
完整的代码示例:
```python
import requests
from bs4 import BeautifulSoup
import urllib.request
import os
url = "http://www.netbian.com/"
download_path = "./images"
if not os.path.exists(download_path):
os.makedirs(download_path)
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
img_tags = soup.find_all("img")
img_urls = [img["src"] for img in img_tags]
for i, img_url in enumerate(img_urls[:10]):
image_name = f"image_{i+1}.jpg"
full_path = os.path.join(download_path, image_name)
try:
with urllib.request.urlopen(img_url) as img_response:
with open(full_path, "wb") as file:
file.write(img_response.read())
except Exception as e:
print(f"Failed to download {img_url}: {e}")
#
阅读全文