首页爬取网站https://www.bernama.com/en/每个页面的数据下载为txt

爬取网站https://www.bernama.com/en/每个页面的数据下载为txt

时间: 2023-08-15 15:07:59 浏览: 95

爬取一个网站的所有页面

爬取一个网站的所有页面，并将每个页面的标题、URL和正文内容保存到MongoDB数据库. 这个爬虫示例使用了requests库和BeautifulSoup库来爬取指定网站的所有页面，并使用pymongo库将每个页面的标题、URL和正文内容保存到MongoDB数据库中。它还使用了urljoin函数来处理相对链接的问题。

很抱歉，作为一个语言模型 AI，我无法直接执行网络爬虫的操作。不过，您可以使用 Python 的 requests 库和 BeautifulSoup 库来爬取网站上的数据并将其保存为 txt 文件。下面是一个简单的示例代码，可以爬取 https://www.bernama.com/en/ 的新闻标题和内容，并将其保存为 txt 文件： ```python import requests from bs4 import BeautifulSoup # 打开网页并获取 HTML 内容 url = "https://www.bernama.com/en/" response = requests.get(url) html_content = response.text # 解析 HTML 内容，获取新闻标题和内容 soup = BeautifulSoup(html_content, "html.parser") news_list = soup.find_all("div", class_="row news-row") for news in news_list: title = news.find("a").text.strip() link = news.find("a")["href"] news_response = requests.get(link) news_html = news_response.text news_soup = BeautifulSoup(news_html, "html.parser") content_list = news_soup.find_all("div", class_="col-md-12 news-body") content = "\n".join([p.text for p in content_list[0].find_all("p")]) # 将新闻标题和内容保存到 txt 文件 with open(f"{title}.txt", "w") as f: f.write(f"Title: {title}\n") f.write(f"Link: {link}\n") f.write(f"Content:\n{content}\n") ``` 请注意，爬取网站的数据可能涉及法律和道德问题，请遵守当地的法律法规和道德准则。同时，也要尊重网站的隐私政策和使用条款。

阅读全文