python爬取网页的图片和文字信息代码
时间: 2024-11-06 21:20:31 浏览: 60
用python爬取网页并导出为word文档.docx
在Python中,可以使用一些库如BeautifulSoup、requests和PIL (Pillow) 来爬取网页上的图片和文本信息。下面是一个简单的示例:
首先安装必要的库,如果尚未安装,可以使用pip命令:
```bash
pip install beautifulsoup4 requests pillow
```
然后编写一个爬虫脚本:
```python
import requests
from bs4 import BeautifulSoup
from PIL import Image
import io
# 获取网页内容
url = "http://example.com" # 替换为你想抓取的网站
response = requests.get(url)
html_content = response.text
# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(html_content, 'lxml')
# 找到所有的图片标签
img_tags = soup.find_all('img')
for img in img_tags:
img_url = img['src'] # 获取图片链接
if 'data:' in img_url: # 如果是base64编码的图片,需先下载再处理
img_data = base64.b64decode(img_url.split(',')[1])
img_name = 'image_' + img['alt'].replace(' ', '_') + '.jpg' # 图片名称,用alt属性作为默认描述
with open(img_name, 'wb') as f:
f.write(img_data)
else:
img_response = requests.get(img_url)
img_name = 'image_' + img['alt'].replace(' ', '_') + '.jpg'
with open(img_name, 'wb') as f:
f.write(img_response.content)
# 寻找并提取文本信息
text = soup.get_text()
print("提取的文本信息:")
print(text)
阅读全文