用Python语言编写爬取百度图片中猫和狗的图片各100张,将图片存入相应文件夹,两个文件夹分别为cat和dog的代码
时间: 2024-06-10 08:05:07 浏览: 180
这是一个典型的网络爬虫任务,你可以使用python的requests和bs4库来获取网页内容并解析数据,使用urllib库下载图片,代码如下:
```Python
import requests
from bs4 import BeautifulSoup
import os
import urllib
# create directories for cat and dog images
if not os.path.exists('cat'):
os.makedirs('cat')
if not os.path.exists('dog'):
os.makedirs('dog')
# desired number of images per category
num_images = 100
# define URLs and user agent
cat_url = 'https://www.baidu.com/s?tn=result&ie=utf-8&word=猫'
dog_url = 'https://www.baidu.com/s?tn=result&ie=utf-8&word=狗'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
# scrape cat images
cat_count = 0
while cat_count < num_images:
# send GET request to cat URL
cat_response = requests.get(cat_url, headers=headers)
cat_soup = BeautifulSoup(cat_response.content, 'html.parser')
# extract image URLs
cat_images = cat_soup.find_all('img', {'src': True})
for cat_image in cat_images:
try:
# download images to cat folder
urllib.request.urlretrieve(cat_image['src'], 'cat/cat' + str(cat_count) + '.jpg')
cat_count += 1
if cat_count == num_images:
break
except:
continue
# scrape dog images
dog_count = 0
while dog_count < num_images:
# send GET request to dog URL
dog_response = requests.get(dog_url, headers=headers)
dog_soup = BeautifulSoup(dog_response.content, 'html.parser')
# extract image URLs
dog_images = dog_soup.find_all('img', {'src': True})
for dog_image in dog_images:
try:
# download images to dog folder
urllib.request.urlretrieve(dog_image['src'], 'dog/dog' + str(dog_count) + '.jpg')
dog_count += 1
if dog_count == num_images:
break
except:
continue
```
该程序首先创建了cat和dog两个目录,用于保存爬虫下载的猫和狗的图片。然后,定义了要爬取的图片数量和要访问的百度搜索结果页面的URL。使用requests库,程序可以以所定义的用户代理发送HTTP GET请求并接收响应,然后使用BeautifulSoup库解析得到的响应内容。程序从HTML标记中提取图片URL并使用Python的urllib库将图片下载到相应的目录中。
请注意,由于百度的反爬虫机制,此代码可能会被禁止访问并返回错误消息。因此,建议您仅在合适的时间间隔内运行该程序,以免触发反爬虫机制并受到不必要的影响。
阅读全文