首页使用正则表达式和requests库对https://tieba.baidu.com/p/9123692460 网站的内容进行提取，包括发帖人，发帖时间，帖子标题，给出具体的实现代码

使用正则表达式和requests库对https://tieba.baidu.com/p/9123692460 网站的内容进行提取，包括发帖人，发帖时间，帖子标题，给出具体的实现代码

时间: 2024-09-18 18:08:15 浏览: 42

在Python中，我们可以使用`re`（正则表达式）模块来解析网页内容，并结合`requests`库来获取HTML数据。以下是一个简单的示例，展示如何提取指定网页上的发帖人、发帖时间和帖子标题： ```python import requests from bs4 import BeautifulSoup import re # 发送GET请求并获取网页内容 url = "https://tieba.baidu.com/p/9123692460" response = requests.get(url) html_content = response.text # 解析HTML内容 soup = BeautifulSoup(html_content, 'lxml') # 定义正则表达式模式，这里假设发帖人信息通常包含在用户名标签内，发帖时间和标题可能分别在特定位置 post_author_pattern = r'<a class="author">.*?</a>' # 提取发帖人 post_time_pattern = r'<span class="ctitle">.*?发布时间：(.*?)</span>' # 提取发帖时间 post_title_pattern = r'<h2>(.*?)</h2>' # 提取帖子标题 # 使用正则表达式查找并提取信息 post_author = re.search(post_author_pattern, html_content).group() post_time = re.search(post_time_pattern, html_content).group(1) # 注意group(1)是因为匹配到的是整个字符串组 post_title = re.search(post_title_pattern, html_content).group() print("发帖人:", post_author) print("发帖时间:", post_time) print("帖子标题:", post_title) #

阅读全文