发起 requests请求,用 XPath或 bs4【二选一】解析豆果美食【url:https://www.douguo.com/jingxuan/0】1~6页的数据,获取标题,详情页链接,作者,点赞数,收藏数,并将结果保存到Excel文件
时间: 2024-11-24 11:41:37 浏览: 7
TAIEX数据:可从https://www.twse.com.tw获取Json原始数据
首先,我会向你展示如何使用Python的`requests`库发起HTTP请求并结合`BeautifulSoup`库解析HTML内容。这里我们选择`BeautifulSoup`,因为它更适合处理网页数据。然后,我们将使用XPath来定位所需的信息。
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
# 定义目标URL和页面范围
base_url = "https://www.douguo.com/jingxuan/"
page_range = list(range(1, 7))
# 函数用于获取单页数据
def get_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 使用XPath表达式提取信息
titles = soup.select('.list-title a')
title_list = [title.text for title in titles]
detail_links = [title['href'] for title in titles]
authors = [soup.find('span', {'class': 'author'}).text for _ in range(len(titles))]
likes = [soup.select_one('.heart_num').text.strip() for _ in range(len(titles))]
favorites = [soup.select_one('.collect_num').text.strip() for _ in range(len(titles))]
return title_list, detail_links, authors, likes, favorites
# 初始化空列表存储所有页面的数据
data = []
for page in page_range:
page_url = base_url + str(page)
titles, links, authors, likes, favorites = get_data(page_url)
data.append([titles, links, authors, likes, favorites])
# 将数据整合到pandas DataFrame
all_data = pd.DataFrame(data, columns=['标题', '详情页链接', '作者', '点赞数', '收藏数'])
# 保存到Excel文件
all_data.to_excel('豆果美食数据.xlsx', index=False)
阅读全文