一整段python逆向分析获取数据的代码,目标:1.获取“新书推荐”图书名称;2.获取图书ID;3.获取图书照片网址;4.存储数据。
时间: 2024-05-02 19:21:34 浏览: 23
由于缺少具体的网站和数据结构,以下代码是一种类似于爬取“新书推荐”网站数据的示例代码,仅供参考。
```python
import requests
from bs4 import BeautifulSoup
import re
import json
# 获取网页源代码
url = "http://www.example.com/new_books"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
response = requests.get(url, headers=headers)
html = response.text
# 解析网页源代码
soup = BeautifulSoup(html, "html.parser")
# 获取“新书推荐”图书名称
book_names = []
book_name_tags = soup.select("div.new-book > h2 > a")
for tag in book_name_tags:
book_names.append(tag.text)
# 获取图书ID
book_ids = []
book_id_tags = soup.select("div.new-book > h2 > a")
for tag in book_id_tags:
book_id = re.findall(r'/book/(\d+)/', tag.get("href"))
book_ids.append(book_id[0])
# 获取图书照片网址
book_photos = []
book_photo_tags = soup.select("div.new-book > div.book-cover > img")
for tag in book_photo_tags:
book_photos.append(tag.get("src"))
# 存储数据
book_list = []
for i in range(len(book_names)):
book_dict = {"book_name": book_names[i],
"book_id": book_ids[i],
"book_photo": book_photos[i]}
book_list.append(book_dict)
with open("new_books.json", "w", encoding="utf-8") as f:
json.dump(book_list, f, ensure_ascii=False)
```
说明:
1. `requests`库用于获取网页源代码;
2. `BeautifulSoup`库用于解析网页源代码;
3. `re`库用于从网址中提取图书ID;
4. `json`库用于将数据存储为JSON格式的文件;
5. `headers`参数用于模拟浏览器请求;
6. `select()`方法用于选择符合条件的HTML标签;
7. 本示例代码中使用了三个循环分别获取书名、ID和照片网址,如果这三个数据可以在同一个标签中获取到,可以只使用一个循环;
8. 最终将获取到的数据存储为一个列表,每个列表元素为一个字典,字典中包含书名、ID和照片网址。