首页实现采集以下页面的新闻标题： https://www.solidot.org/ 方式1： urllib + 正则

实现采集以下页面的新闻标题： https://www.solidot.org/ 方式1： urllib + 正则

时间: 2024-12-17 16:20:21 浏览: 16

要使用Python的`urllib`库和正则表达式(`re`)来抓取Solidot.org网站的新闻标题，首先需要安装必要的库（如果还没有的话），可以使用`requests`库获取网页内容，然后解析HTML并提取标题。以下是简单的步骤： 1. 安装所需库（如果未安装）： ```bash pip install requests beautifulsoup4 ``` 2. 使用Python编写代码： ```python import requests from bs4 import BeautifulSoup import re # 请求URL url = "https://www.solidot.org/" response = requests.get(url) # 检查请求是否成功 if response.status_code == 200: # 解析HTML内容 soup = BeautifulSoup(response.text, 'html.parser') # 查找所有的<a>标签（通常包含新闻标题） titles = soup.find_all('a', class_='storytitle') # 提取标题并删除多余的HTML元素 news_titles = [re.sub(r'<[^>]*>', '', title.text) for title in titles] news_titles = [title.strip() for title in news_titles] # 打印或保存标题列表 for title in news_titles: print(title) else: print(f"无法访问页面，状态码：{response.status_code}")

阅读全文