首页爬取https://www.xuexi.cn/f997e76a890b0e5a053c57b19f468436/018d244441062d8916dd472a4c6a0a0b.html页面中的新闻数据。

爬取https://www.xuexi.cn/f997e76a890b0e5a053c57b19f468436/018d244441062d8916dd472a4c6a0a0b.html页面中的新闻数据。

时间: 2023-10-31 10:59:18 浏览: 33

好的，这个任务可以使用 Python 的 requests 和 BeautifulSoup 库来实现。以下是代码实现： ```python import requests from bs4 import BeautifulSoup url = 'https://www.xuexi.cn/f997e76a890b0e5a053c57b19f468436/018d244441062d8916dd472a4c6a0a0b.html' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') news_list = soup.find_all('div', class_='text-wrap') for news in news_list: title = news.find('h1').text.strip() # 获取新闻标题 time = news.find('span', class_='time').text.strip() # 获取新闻时间 content = ''.join([p.text.strip() for p in news.find_all('p')]) # 获取新闻内容 print(title) print(time) print(content) print('------------------------------------') ``` 这段代码首先使用 requests 库获取页面的 HTML 内容，然后使用 BeautifulSoup 库解析 HTML。接着，我们可以使用 `find_all` 方法找到所有包含新闻信息的 `<div>` 元素，并逐一解析出新闻的标题、时间和内容。最后，将这些信息打印出来即可。需要注意的是，由于该页面是动态加载的，因此如果直接使用 requests 库获取 HTML 内容，可能无法获取到完整的新闻数据。可以使用 Selenium 等工具模拟浏览器行为获取完整的页面内容。