首页爬取网站https://www.bernama.com/en/不同页面的数据，每个页面的数据单独保存到TXT中

爬取网站https://www.bernama.com/en/不同页面的数据，每个页面的数据单独保存到TXT中

时间: 2023-12-08 13:03:27 浏览: 61

爬网页数据

通过VBA代码，爬取网页数据，数据写到excel的sheet表中。

您可以使用 Python 的 requests 库和 BeautifulSoup 库来爬取不同页面的数据，并将每个页面的数据单独保存到 TXT 文件中。下面是一个示例代码： ```python import requests from bs4 import BeautifulSoup # 爬取的页面数量 num_pages = 5 # 打开每个页面并获取 HTML 内容 for page_num in range(1, num_pages + 1): url = f"https://www.bernama.com/en/archive.php?page={page_num}" response = requests.get(url) html_content = response.text # 解析 HTML 内容，获取新闻标题和内容 soup = BeautifulSoup(html_content, "html.parser") news_list = soup.find_all("div", class_="row news-row") for news in news_list: title = news.find("a").text.strip() link = news.find("a")["href"] news_response = requests.get(link) news_html = news_response.text news_soup = BeautifulSoup(news_html, "html.parser") content_list = news_soup.find_all("div", class_="col-md-12 news-body") content = "\n".join([p.text for p in content_list[0].find_all("p")]) # 将新闻标题和内容保存到 txt 文件 with open(f"page{page_num}_{title}.txt", "w") as f: f.write(f"Title: {title}\n") f.write(f"Link: {link}\n") f.write(f"Content:\n{content}\n") ``` 在此示例代码中，我们设置了要爬取的页面数量为 5，然后在一个循环中打开每个页面，并将每个页面的新闻信息单独保存到一个 TXT 文件中。请注意，保存文件时为每个文件添加了一个 page_num 的前缀，以便区分不同页面的文件。请注意，爬取网站的数据可能涉及法律和道德问题，请遵守当地的法律法规和道德准则。同时，也要尊重网站的隐私政策和使用条款。

阅读全文