编写爬取起点小说网的多线程爬虫
时间: 2023-06-01 13:03:27 浏览: 220
多线程爬虫
以下是使用Python编写的爬取起点小说网的多线程爬虫代码:
```python
import requests
from lxml import etree
from queue import Queue
import threading
# 定义爬虫类
class Spider():
def __init__(self, url, headers):
self.url = url
self.headers = headers
self.session = requests.Session()
self.session.headers.update(headers)
# 获取小说列表
def get_novel_list(self):
response = self.session.get(self.url)
html = etree.HTML(response.text)
novel_list = html.xpath('//div[@class="book-mid-info"]/h4/a/@href')
return novel_list
# 获取小说信息
def get_novel_info(self, novel_url):
response = self.session.get(novel_url)
html = etree.HTML(response.text)
novel_info = {}
novel_info['title'] = html.xpath('//div[@class="book-info "]/div[@class="book-info "]/h1/em/text()')[0]
novel_info['author'] = html.xpath('//div[@class="book-info "]/div[@class="book-info "]/h1/span/a/text()')[0]
novel_info['intro'] = html.xpath('//div[@class="book-intro"]/p/text()')[0]
novel_info['word_count'] = html.xpath('//div[@class="book-info "]/div[@class="book-info "]/p/span[1]/text()')[0]
return novel_info
# 定义爬取线程类
class SpiderThread(threading.Thread):
def __init__(self, spider, novel_queue):
threading.Thread.__init__(self)
self.spider = spider
self.novel_queue = novel_queue
def run(self):
while True:
try:
novel_url = self.novel_queue.get(False)
novel_info = self.spider.get_novel_info(novel_url)
print(novel_info)
except:
break
# 定义主函数
def main():
url = 'https://www.qidian.com/all'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
spider = Spider(url, headers)
novel_list = spider.get_novel_list()
# 创建小说队列
novel_queue = Queue()
# 将小说列表加入队列
for novel_url in novel_list:
novel_queue.put(novel_url)
# 创建爬取线程
threads = []
for i in range(5):
spider_thread = SpiderThread(spider, novel_queue)
spider_thread.start()
threads.append(spider_thread)
# 等待所有线程结束
for t in threads:
t.join()
if __name__ == '__main__':
main()
```
该代码使用了Python的requests库和lxml库来进行网页爬取和解析,使用了多线程来提高爬取效率。首先定义了一个Spider类来实现爬取小说列表和小说信息的功能,然后定义了一个SpiderThread类来作为爬取线程,最后在主函数中创建小说队列和爬取线程,并等待所有线程结束。
阅读全文