python抓取网页及其子网页保存html
时间: 2024-09-07 22:06:14 浏览: 61
在Python中,你可以使用`requests`库来发送HTTP请求并获取网页内容,然后利用`BeautifulSoup`库解析HTML文档。对于抓取子网页,可以结合递归或者深度优先搜索(DFS)。以下是简单的步骤:
1. 安装所需库:
```bash
pip install requests beautifulsoup4
```
2. 使用`requests`获取网页内容:
```python
import requests
def get_html(url):
response = requests.get(url)
if response.status_code == 200:
return response.text
else:
print(f"Failed to fetch URL: {url}")
return None
```
3. 使用`BeautifulSoup`解析HTML:
```python
from bs4 import BeautifulSoup
def parse_html(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
# 这里根据需要提取你需要的信息
# 可能涉及到select, find_all等方法
return soup.prettify() # 返回美化后的HTML字符串
```
4. 抓取子页面(这里用递归示例):
```python
def crawl_subpages(url, depth=1):
html = get_html(url)
if html is not None:
main_soup = parse_html(html)
# 找到所有链接(这只是一个基本示例)
for link in main_soup.find_all('a'):
href = link.get('href')
if is_valid_url(href): # 自定义判断是否为有效链接
sub_url = url + href
if depth > 0:
crawl_subpages(sub_url, depth - 1)
# 保存主要内容
save_to_file(main_soup, 'main_page.html')
# 递归入口
crawl_subpages('http://example.com', depth=2) # 指定最大抓取深度为2层
```
5. 保存HTML文件:
```python
def save_to_file(soup, filename):
with open(filename, 'w', encoding='utf-8') as f:
f.write(soup)
```
阅读全文