首页使用正则表达式和文件操作爬取并保存“百度贴吧”中和宜昌有关某个帖子

使用正则表达式和文件操作爬取并保存“百度贴吧”中和宜昌有关某个帖子

时间: 2024-12-28 19:38:04 浏览: 17

在Python中，你可以使用`re`模块处理正则表达式，`os`和`urllib`模块进行文件操作以及`requests`库来获取网页内容。这里是一个基本步骤： 1. **安装必要的库**：首先，确保已经安装了`requests`库，如果没有，可以使用`pip install requests`进行安装。 2. **编写脚本**： ```python import re import os import requests # 定义目标URL url = "https://tieba.baidu.com/search?q=宜昌&ie=utf-8&pn=" # 这里替换为具体的搜索关键词和分页参数 # 搜索每一页的结果 for page in range(1, 6): # 假设每页有50条结果，我们抓取前几页 response = requests.get(url + str(page)) html_content = response.text # 使用正则表达式查找包含关键词"宜昌"的帖子链接 post_links = re.findall(r'<a href="(thread.*?)">.*?宜昌.*?</a>', html_content, re.IGNORECASE) # 创建目录存储下载的帖子，如果不存在就创建 post_dir = 'posts' if not os.path.exists(post_dir): os.makedirs(post_dir) # 对每个帖子链接进行下载 for link in post_links: post_url = f"https://tieba.baidu.com{link}" post_response = requests.get(post_url) with open(os.path.join(post_dir, f'{link.split("/")[-1]}.html'), 'wb') as file: file.write(post_response.content)

阅读全文

CSDN会员

开通CSDN年卡参与万元壕礼抽奖

海量 VIP免费资源千本正版电子书商城会员专享价千门课程&专栏

全年可省5,000元立即开通全年可省5,000元立即开通