python多线程爬豆瓣案例
时间: 2023-12-03 16:49:03 浏览: 87
好的,以下是一个使用多线程爬取豆瓣图书信息的Python爬虫案例:
```python
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
url_template = 'https://book.douban.com/tag/编程?start={}&type=T'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
def get_book_list(start):
url = url_template.format(start)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
book_list = soup.find_all('li', class_='subject-item')
return book_list
def get_book_info(book):
title = book.find('div', class_='info').a.get_text().strip()
rating = book.find('span', class_='rating_nums').get_text().strip()
return title, rating
if __name__ == '__main__':
with ThreadPoolExecutor(max_workers=10) as executor:
futures = []
for start in range(0, 100, 20):
futures.append(executor.submit(get_book_list, start))
books = []
for future in futures:
books.extend(future.result())
futures = []
for book in books:
futures.append(executor.submit(get_book_info, book))
for future in futures:
title, rating = future.result()
print(title, rating)
```
以上代码使用了concurrent.futures库中的ThreadPoolExecutor来实现多线程爬取。在第一个循环中,我们使用executor.submit将get_book_list函数提交给线程池,并将返回的future对象保存到列表中。在第二个循环中,我们使用executor.submit将get_book_info函数提交给线程池,并将返回的future对象保存到列表中。最后,我们通过调用future.result()来获取每个线程的结果,并打印出书籍信息。
阅读全文