用python编写多线程爬虫,爬取豆瓣读书网游记标签下前10页书籍的信息,包括书名、作者、出版社、出版 日期、页数、价格、ISBN号、豆瓣评分、评价人数、书籍封面、 URL,,线程数量至少4个,不使用scrapy,用XPath解析网页,并按照豆瓣评分降序保存到csv中,csv以标签名称命名,代码要简洁,有注释有输出
时间: 2023-10-20 22:04:07 浏览: 111
基于Python所写,豆瓣读书的爬虫
好的,我了解了你的问题。首先,我们需要安装一些库,包括 requests, lxml 和 pandas:
```python
import requests
from lxml import etree
import pandas as pd
import threading
```
然后,我们可以定义一个函数来获取每个页面的数据:
```python
def get_data(page):
url = f'https://book.douban.com/tag/%E6%B8%B8%E8%AE%B0?start={page*20}&type=T'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
r = requests.get(url, headers=headers)
html = etree.HTML(r.text)
books = html.xpath('//ul[@class="subject-list"]/li')
data = []
for book in books:
title = book.xpath('.//h2/a/text()')[0].strip()
author = book.xpath('.//div[@class="pub"]/text()')[0].strip()
publisher = book.xpath('.//div[@class="pub"]/text()')[1].strip()
pub_date = book.xpath('.//div[@class="pub"]/text()')[2].strip()
pages = book.xpath('.//div[@class="pub"]/text()')[3].strip()
price = book.xpath('.//div[@class="pub"]/text()')[4].strip()
isbn = book.xpath('.//div[@class="pub"]/text()')[5].strip()
rating = book.xpath('.//span[@class="rating_nums"]/text()')[0]
ratings = book.xpath('.//span[@class="pl"]/text()')[0]
cover = book.xpath('.//img/@src')[0]
link = book.xpath('.//h2/a/@href')[0]
data.append([title, author, publisher, pub_date, pages, price, isbn, rating, ratings, cover, link])
return data
```
接下来,我们可以使用多线程来加速爬取:
```python
def crawl(pages):
data = []
threads = []
for page in range(pages):
t = threading.Thread(target=lambda d, p: d.extend(get_data(p)), args=(data, page))
t.start()
threads.append(t)
for t in threads:
t.join()
df = pd.DataFrame(data, columns=['书名', '作者', '出版社', '出版日期', '页数', '价格', 'ISBN', '豆瓣评分', '评价人数', '封面链接', '书籍链接'])
df.sort_values(by='豆瓣评分', ascending=False, inplace=True)
df.to_csv('游记.csv', index=False)
```
最后,我们可以调用 crawl 函数并传入需要爬取的页面数:
```python
crawl(10)
```
这样,程序就可以爬取豆瓣读书网游记标签下前10页书籍的信息,并按照豆瓣评分降序保存到一个名为“游记.csv”的文件中了。
阅读全文