用python爬取网页的文字和图片给出可运行的具体代码
时间: 2024-11-03 14:20:43 浏览: 38
在Python中,我们可以使用BeautifulSoup库来解析HTML文档并提取文字,以及requests库来下载图片。以下是一个简单的示例,展示如何爬取网页内容:
首先,确保已安装`requests`和`beautifulsoup4`库:
```bash
pip install requests beautifulsoup4
```
然后,编写一个函数来获取网页内容、解析文字和图片URL:
```python
import requests
from bs4 import BeautifulSoup
import os
def crawl_webpage(url):
# 发送HTTP请求
response = requests.get(url)
# 检查请求是否成功
if response.status_code == 200:
# 解析HTML
soup = BeautifulSoup(response.text, 'html.parser')
# 提取文章文字
article_text = soup.get_text()
# 提取所有img标签的src属性作为图片URL
img_tags = soup.find_all('img')
image_urls = [img['src'] for img in img_tags if 'src' in img.attrs]
# 下载图片 (这里只是一个基础示例,实际操作需要处理文件保存路径等)
download_images(image_urls, save_dir='./images')
print(f"Article text:\n{article_text[:100] + '...' if len(article_text) > 100 else article_text}")
print("Images downloaded to ./images directory.")
else:
print(f"Failed to fetch the webpage with status code {response.status_code}")
def download_images(urls, save_dir='.', filename_prefix=''):
if not os.path.exists(save_dir):
os.makedirs(save_dir)
for i, url in enumerate(urls):
response = requests.get(url)
if response.status_code == 200:
file_name = f"{filename_prefix}image_{i}.jpg" if filename_prefix else f'image_{i}.jpg'
with open(os.path.join(save_dir, file_name), 'wb') as f:
f.write(response.content)
else:
print(f"Failed to download image {url}, got status code {response.status_code}")
# 使用函数
target_url = "https://example.com" # 替换为你想要爬取的网页地址
crawl_webpage(target_url)
```
阅读全文