写一个Python 的爬虫框架
时间: 2024-02-16 16:52:09 浏览: 26
好的,下面是一个简单的Python爬虫框架的示例代码:
```python
import requests
from bs4 import BeautifulSoup
class Spider:
def __init__(self, url):
self.url = url
self.visited = set()
def get_links(self, html):
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.find_all('a'):
href = link.get('href')
if href and href.startswith('http') and href not in self.visited:
links.append(href)
return links
def crawl(self):
html = requests.get(self.url).text
self.visited.add(self.url)
links = self.get_links(html)
for link in links:
print(link)
if link not in self.visited:
self.url = link
self.crawl()
if __name__ == '__main__':
spider = Spider('https://www.example.com/')
spider.crawl()
```
这个爬虫框架使用了requests和BeautifulSoup库来获取网页内容并解析网页中的链接。它使用递归来遍历所有链接,并使用一个集合来记录已访问的链接,以避免重复访问。您可以根据需要修改此框架,例如添加更多的功能或更改链接的匹配规则。