用requests库和xpath来获取‘百度热搜’https://top.baidu.com/board?tab=realtime排行榜中,标题,内容,图片链接和热搜指数,并存储为.csv/.txt格式。
时间: 2024-10-16 21:06:05 浏览: 46
TAIEX数据:可从https://www.twse.com.tw获取Json原始数据
使用Python的`requests`库可以发送HTTP请求获取网页内容,而`lxml`库(它基于`libxml2`和`libxslt`)结合`XPath`语法可以方便地从HTML文档中提取数据。以下是一个基本步骤的示例:
1. 首先,你需要安装所需的库,如果尚未安装,可以通过pip安装:
```
pip install requests lxml
```
2. 然后编写一个Python脚本来完成任务:
```python
import requests
from bs4 import BeautifulSoup
from lxml import etree
import csv
# 发送GET请求
url = "https://top.baidu.com/board?tab=realtime"
response = requests.get(url)
# 检查请求是否成功
if response.status_code == 200:
# 解析HTML响应
html_content = response.text
parser = etree.HTMLParser()
soup = BeautifulSoup(html_content, 'lxml', parser=parser)
# 使用XPath查找需要的数据
titles = soup.xpath('//div[@class="hot-item-title"]//a/text()')
contents = soup.xpath('//div[@class="hot-item-title"]//span[@class="hot-item-content"]/text()')
img_links = soup.xpath('//div[@class="hot-item-title"]//img/@src')
# 获取热搜指数,这里假设指数在每个元素的某个特定位置,例如"hot-score"属性
heat_indices = [el.get('heat-score') for el in soup.xpath('//div[@class="hot-item-title"]')]
# 将数据存储到CSV文件
with open('titles_contents.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['标题', '内容', '图片链接', '热搜指数'])
for title, content, img_link, heat_index in zip(titles, contents, img_links, heat_indices):
writer.writerow([title, content, img_link, heat_index])
# 数据也可以存储到TXT文件,每行一条记录
with open('titles_contents.txt', 'w', encoding='utf-8') as f:
for title, content, img_link, heat_index in zip(titles, contents, img_links, heat_indices):
f.write(f'标题: {title}\n内容: {content}\n图片链接: {img_link}\n热搜指数: {heat_index}\n\n')
else:
print("请求失败,状态码:", response.status_code)
阅读全文