编辑一个多线程爬虫编码的框架
时间: 2023-11-28 15:05:27 浏览: 73
基于Java的多线程爬虫框架
以下是一个多线程爬虫编码的框架:
```python
import threading
import queue
import requests
from bs4 import BeautifulSoup
class Crawler:
def __init__(self, start_url, num_threads):
self.start_url = start_url
self.num_threads = num_threads
self.queue = queue.Queue()
self.visited = set()
self.lock = threading.Lock()
def enqueue(self, url):
self.queue.put(url)
def dequeue(self):
return self.queue.get()
def is_visited(self, url):
with self.lock:
if url in self.visited:
return True
else:
self.visited.add(url)
return False
def crawl(self):
self.enqueue(self.start_url)
for i in range(self.num_threads):
t = threading.Thread(target=self.worker)
t.daemon = True
t.start()
self.queue.join()
def worker(self):
while True:
url = self.dequeue()
if self.is_visited(url):
continue
print("Crawling: " + url)
try:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# process the page here
# ...
# add links to the queue
for link in soup.find_all("a"):
self.enqueue(link.get("href"))
except Exception as e:
print("Error: " + str(e))
if __name__ == "__main__":
crawler = Crawler("https://www.example.com", 10)
crawler.crawl()
```
这个框架使用了一个队列来存储待处理的链接,并且使用一个集合来存储已经访问过的链接。当一个链接被取出时,会首先检查它是否已经被访问过。如果已经被访问过,则跳过这个链接。否则,将其标记为已访问,并进行爬取。爬取完成后,会从页面中提取所有链接,并将它们添加到队列中等待处理。这个框架使用多个线程来并行处理链接,可以加快爬取速度。
阅读全文