编写多线程爬虫(线程数量至少4个,使用xpth)爬取杂文标签下 所有书籍(至少爬取前10页)的信息(包括书名、作者、出版社、出版 日期、页数、价格、ISBN号、豆瓣评分、评价人数、书籍封面、 URL),并按照豆瓣评分降序保存(以标签名称命名)
时间: 2023-10-20 14:01:50 浏览: 161
好的,我会为您解答这个问题。首先我们需要了解一下Python的多线程爬虫基本原理。
Python的多线程爬虫的基本流程如下:
1. 创建一个线程池,线程数量至少为4个。
2. 将任务分配给线程池中的线程。
3. 每个线程拥有自己的爬虫实例,可以独立地爬取网页。
4. 爬取到的数据需要保存到本地或者数据库中。
接下来,我们可以根据这个流程来实现代码。具体实现步骤如下:
1. 导入必要的库
```python
import requests
from lxml import etree
import threading
import os
import csv
```
2. 定义爬虫函数
```python
def spider(start_page, end_page):
url = 'https://book.douban.com/tag/%E6%9D%82%E6%96%87?start={}&type=T'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
for i in range(start_page, end_page):
url_ = url.format(i * 20)
response = requests.get(url_, headers=headers)
html = etree.HTML(response.text)
items = html.xpath('//li[@class="subject-item"]')
for item in items:
book_title = item.xpath('.//h2/a/text()')[0]
book_url = item.xpath('.//h2/a/@href')[0]
book_info = item.xpath('.//div[@class="pub"]/text()')[0]
book_rate = item.xpath('.//span[@class="rating_nums"]/text()')[0]
book_people = item.xpath('.//span[@class="pl"]/text()')[0]
book_cover = item.xpath('.//img/@src')[0]
book_info_list = book_info.split('/')
book_author = book_info_list[0]
book_publisher = book_info_list[-3]
book_date = book_info_list[-2]
book_pages = book_info_list[-4]
book_price = book_info_list[-1]
book_isbn = item.xpath('.//div[@class="star clearfix"]/span[3]/text()')[0]
with open('book.csv', 'a', newline='', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerow([book_title, book_author, book_publisher, book_date, book_pages, book_price, book_isbn, book_rate, book_people, book_cover, book_url])
```
3. 定义线程函数
```python
def thread_func(start_page, end_page):
spider(start_page, end_page)
```
4. 将任务分配给线程池中的线程
```python
threads = []
for i in range(0, 10, 2):
t = threading.Thread(target=thread_func, args=(i, i+2))
threads.append(t)
for t in threads:
t.start()
for t in threads:
t.join()
```
5. 对爬取到的数据按照豆瓣评分降序保存
```python
import pandas as pd
df = pd.read_csv('book.csv', header=None, names=['book_title', 'book_author', 'book_publisher', 'book_date', 'book_pages', 'book_price', 'book_isbn', 'book_rate', 'book_people', 'book_cover', 'book_url'], encoding='utf-8')
df_sorted = df.sort_values(by=['book_rate'], ascending=False)
df_sorted.to_csv('杂文.csv', index=False, encoding='utf-8-sig')
os.remove('book.csv')
```
完整代码如下:
阅读全文