python同时爬取多个网站数据
时间: 2023-10-06 11:05:54 浏览: 205
要同时爬取多个网站的数据,可以使用多线程或异步IO的方式来实现。
1. 多线程:使用Python的`threading`模块来创建多个线程,每个线程负责爬取一个网站的数据。可以使用`threading.Thread`来创建线程对象,使用`threading.Lock`来实现线程安全。示例代码:
```python
import threading
import requests
class Crawler(threading.Thread):
def __init__(self, url):
super().__init__()
self.url = url
self.data = None
self.lock = threading.Lock()
def run(self):
response = requests.get(self.url)
with self.lock:
self.data = response.text
```
创建多个Crawler对象,每个对象的url不同,然后启动所有线程:
```python
urls = ['http://www.example.com', 'http://www.example2.com', 'http://www.example3.com']
crawlers = [Crawler(url) for url in urls]
for crawler in crawlers:
crawler.start()
for crawler in crawlers:
crawler.join()
```
2. 异步IO:使用Python的`asyncio`模块来实现异步IO,可以使用`asyncio.gather`来并发执行多个协程对象。示例代码:
```python
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def main():
urls = ['http://www.example.com', 'http://www.example2.com', 'http://www.example3.com']
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)
print(results)
asyncio.run(main())
```
以上是两种方法的简单示例,具体实现需要根据实际情况进行调整。
阅读全文