首页编写一个150行爬虫爬取网页的代码

编写一个150行爬虫爬取网页的代码

时间: 2024-12-31 16:48:31 浏览: 11

编写一个基本的Python爬虫代码大约需要几十行，因为还需要考虑异常处理、循环、数据结构等因素。下面是一个简单的例子，使用了`requests`库来发送HTTP请求，`BeautifulSoup`库来解析HTML，以及`time`库来控制请求间隔： ```python import requests from bs4 import BeautifulSoup import time # 目标URL url = 'https://example.com' # 定义请求函数 def get_html(url): try: response = requests.get(url, timeout=5) response.raise_for_status() # 如果状态码不是200，抛出异常 return response.text except (requests.RequestException, ValueError) as e: print(f"抓取失败: {e}") return None # 解析并提取数据 def parse_content(html): soup = BeautifulSoup(html, 'lxml') # 假设我们要抓取的是所有段落<p>标签的内容 data = [p.text for p in soup.find_all('p')] return data # 控制爬取速度 def crawl_and_save(data): for item in data: with open('output.txt', 'a', encoding='utf-8') as f: f.write(item + '\n') time.sleep(1) # 每次请求后暂停1秒 # 主程序 while True: html = get_html(url) if html is not None: content = parse_content(html) crawl_and_save(content) ``` 请注意，这只是一个基础示例，实际项目中可能需要更复杂的错误处理、分页处理、登录功能或者使用Selenium处理动态加载的内容等。同时，一定要遵守网站的Robots协议，不要对目标网站造成过度的压力。

阅读全文