首页用python爬取网页的文字和图片给出可运行的具体代码

用python爬取网页的文字和图片给出可运行的具体代码

时间: 2024-11-03 16:20:43 浏览: 44

在Python中，我们可以使用BeautifulSoup库来解析HTML文档并提取文字，以及requests库来下载图片。以下是一个简单的示例，展示如何爬取网页内容：首先，确保已安装`requests`和`beautifulsoup4`库： ```bash pip install requests beautifulsoup4 ``` 然后，编写一个函数来获取网页内容、解析文字和图片URL： ```python import requests from bs4 import BeautifulSoup import os def crawl_webpage(url): # 发送HTTP请求 response = requests.get(url) # 检查请求是否成功 if response.status_code == 200: # 解析HTML soup = BeautifulSoup(response.text, 'html.parser') # 提取文章文字 article_text = soup.get_text() # 提取所有img标签的src属性作为图片URL img_tags = soup.find_all('img') image_urls = [img['src'] for img in img_tags if 'src' in img.attrs] # 下载图片 (这里只是一个基础示例，实际操作需要处理文件保存路径等) download_images(image_urls, save_dir='./images') print(f"Article text:\n{article_text[:100] + '...' if len(article_text) > 100 else article_text}") print("Images downloaded to ./images directory.") else: print(f"Failed to fetch the webpage with status code {response.status_code}") def download_images(urls, save_dir='.', filename_prefix=''): if not os.path.exists(save_dir): os.makedirs(save_dir) for i, url in enumerate(urls): response = requests.get(url) if response.status_code == 200: file_name = f"{filename_prefix}image_{i}.jpg" if filename_prefix else f'image_{i}.jpg' with open(os.path.join(save_dir, file_name), 'wb') as f: f.write(response.content) else: print(f"Failed to download image {url}, got status code {response.status_code}") # 使用函数 target_url = "https://example.com" # 替换为你想要爬取的网页地址 crawl_webpage(target_url) ```

阅读全文