爬不出销量信息和出版日期信息import requests from bs4 import BeautifulSoup from threading import Thread def crawl_books(start, end): for i in range(start, end): url = 'http://search.dangdang.com/?key=%BC%C6%CB%E3%BB%FA&act=input&page_index={}'.format(i) response = requests.get(url) page = response.text soup = BeautifulSoup(page, 'lxml') books = soup.find('ul', class_='bigimg') for book in books.find_all('li'): title = book.find('a', class_='pic').get('title') author = book.find('p', class_='search_book_author').text price = book.find('p', class_='price').find('span', class_='search_now_price').text sales_tag = book.find('span', class_='search_sales') sales = sales_tag.text if sales_tag else '无销量信息' date_tag = book.find('span', class_='search_book_publishtime') date = date_tag.text if date_tag else '无出版日期信息' print(title, author, price, sales, date) threads = [] for i in range(1, 101, 10): t = Thread(target=crawl_books, args=(i, i+10)) threads.append(t) t.start() for t in threads: t.join()
时间: 2024-02-29 14:55:18 浏览: 15
在您的代码中,获取销量信息和出版日期信息的方式是正确的,但是可能存在以下原因导致无法获取到信息:
1. 网站页面上没有包含销量信息或出版日期信息的标签。您可以手动查看网站页面上的源代码,确认是否存在相关标签。
2. 网站页面使用了动态加载技术,导致部分信息无法在页面初始化时被获取到。您可以考虑使用Selenium等工具模拟用户操作,触发页面的动态加载,再进行信息的抓取。
3. 网站针对爬虫进行了反爬虫处理,导致信息无法被正常获取。您可以尝试使用代理IP、修改请求头等方式进行反反爬虫操作。
建议您仔细检查以上情况,如果仍然无法解决问题,可以提供更多具体的信息和网站链接,我们会尽力为您提供帮助。
相关问题
写出增加爬取书籍评论数的代码import requests from bs4 import BeautifulSoup from threading import Thread def crawl_books(start, end): for i in range(start, end): url = 'http://search.dangdang.com/?key=%BC%C6%CB%E3%BB%FA&act=input&page_index={}'.format(i) response = requests.get(url) page = response.text soup = BeautifulSoup(page, 'lxml') books = soup.find('ul', class_='bigimg') for book in books.find_all('li'): title = book.find('a', class_='pic').get('title') author = book.find('p', class_='search_book_author').text price = book.find('p', class_='price').find('span', class_='search_now_price').text print(title, author, price) threads = [] for i in range(1, 101, 10): t = Thread(target=crawl_books, args=(i, i+10)) threads.append(t) t.start() for t in threads: t.join()
上面的代码只是用来爬取当当网中与搜索关键词“心理学”相关的书籍的名称、作者和价格,而没有爬取书籍评论数。如果需要爬取书籍评论数,可以先获取每本书的详情页链接,然后再进入详情页爬取评论数。具体的代码可以参考下面的示例:
```
import requests
from bs4 import BeautifulSoup
from threading import Thread
def get_book_links(start, end):
links = []
for i in range(start, end):
url = 'http://search.dangdang.com/?key=%BC%C6%CB%E3%BB%FA&act=input&page_index={}'.format(i)
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
books = soup.find('ul', class_='bigimg')
for book in books.find_all('li'):
link = book.find('a', class_='pic').get('href')
links.append(link)
return links
def get_comment_count(url):
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
count = soup.find('a', class_='comm_num').text
return count
def crawl_books(start, end):
links = get_book_links(start, end)
for link in links:
count = get_comment_count(link)
print(link, count)
threads = []
for i in range(1, 101, 10):
t = Thread(target=crawl_books, args=(i, i+10))
threads.append(t)
t.start()
for t in threads:
t.join()
```
上面的代码中,我们新增了两个函数:`get_book_links` 和 `get_comment_count`。`get_book_links` 用来获取每本书的详情页链接,`get_comment_count` 用来获取详情页中的评论数。在 `crawl_books` 函数中,我们先获取每个搜索结果页中的书籍链接,然后再调用 `get_comment_count` 函数获取评论数,并打印出来。最后,我们通过多线程的方式同时爬取多个搜索结果页中的书籍评论数。
增加代码对评论数量的爬取import requests from bs4 import BeautifulSoup from threading import Thread def crawl_books(start, end): for i in range(start, end): url = 'http://search.dangdang.com/?key=%BF%C6%BB%C3%D0%A1%CB%B5&act=input&page_index=1&sort_type=sort_default#J_tab'.format(i) response = requests.get(url) page = response.text soup = BeautifulSoup(page, 'lxml') books = soup.find('ul', class_='bigimg') for book in books.find_all('li'): title = book.find('a', class_='pic').get('title') author = book.find('p', class_='search_book_author').text price = book.find('p', class_='price').find('span', class_='search_now_price').text print(title, author, price) threads = [] for i in range(1, 101, 10): t = Thread(target=crawl_books, args=(i, i+10)) threads.append(t) t.start() for t in threads: t.join()
要增加代码对评论数量的爬取,您需要先确定评论数量的位置和元素,然后在循环中找到该元素并提取其文本。以下是一个修改过的爬取代码示例:
import requests
from bs4 import BeautifulSoup
from threading import Thread
def crawl_books(start, end):
for i in range(start, end):
url = 'http://search.dangdang.com/?key=%BF%C6%BB%C3%D0%A1%CB%B5&act=input&page_index=1&sort_type=sort_default#J_tab'.format(i)
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
books = soup.find('ul', class_='bigimg')
for book in books.find_all('li'):
title = book.find('a', class_='pic').get('title')
author = book.find('p', class_='search_book_author').text
price = book.find('p', class_='price').find('span', class_='search_now_price').text
comment = book.find('p', class_='search_star_line').find_all('a')[2].text # 获取评论数量元素并提取文本
print(title, author, price, comment)
threads = []
for i in range(1, 101, 10):
t = Thread(target=crawl_books, args=(i, i+10))
threads.append(t)
t.start()
for t in threads:
t.join()
在这个示例中,我们使用了`book.find('p', class_='search_star_line').find_all('a')[2].text`来找到评论数量元素并提取文本。您可能需要根据网站的具体结构进行调整。