首页python抓取网页及其子网页保存html

python抓取网页及其子网页保存html

时间: 2024-09-07 22:06:14 浏览: 61

在Python中，你可以使用`requests`库来发送HTTP请求并获取网页内容，然后利用`BeautifulSoup`库解析HTML文档。对于抓取子网页，可以结合递归或者深度优先搜索（DFS）。以下是简单的步骤： 1. 安装所需库： ```bash pip install requests beautifulsoup4 ``` 2. 使用`requests`获取网页内容： ```python import requests def get_html(url): response = requests.get(url) if response.status_code == 200: return response.text else: print(f"Failed to fetch URL: {url}") return None ``` 3. 使用`BeautifulSoup`解析HTML： ```python from bs4 import BeautifulSoup def parse_html(html_content): soup = BeautifulSoup(html_content, 'html.parser') # 这里根据需要提取你需要的信息 # 可能涉及到select, find_all等方法 return soup.prettify() # 返回美化后的HTML字符串 ``` 4. 抓取子页面（这里用递归示例）： ```python def crawl_subpages(url, depth=1): html = get_html(url) if html is not None: main_soup = parse_html(html) # 找到所有链接（这只是一个基本示例） for link in main_soup.find_all('a'): href = link.get('href') if is_valid_url(href): # 自定义判断是否为有效链接 sub_url = url + href if depth > 0: crawl_subpages(sub_url, depth - 1) # 保存主要内容 save_to_file(main_soup, 'main_page.html') # 递归入口 crawl_subpages('http://example.com', depth=2) # 指定最大抓取深度为2层 ``` 5. 保存HTML文件： ```python def save_to_file(soup, filename): with open(filename, 'w', encoding='utf-8') as f: f.write(soup) ```

阅读全文