首页在 pycharm 中爬取古诗文网中古诗内容并输出

在 pycharm 中爬取古诗文网中古诗内容并输出

时间: 2024-12-22 09:32:45 浏览: 36

在PyCharm中爬取古诗文网（gushiwen.org）的古诗内容，你需要结合前面提到的步骤。这里是一个简化的示例，假设我们想要抓取每首诗的标题和内容：首先，确保安装了`requests`, `beautifulsoup4`库，如果没有，可以通过pip安装： ```bash pip install requests beautifulsoup4 ``` 然后，在PyCharm中创建一个新的Python项目，编写以下代码： ```python import requests from bs4 import BeautifulSoup def get_poem_content(url): headers = {'User-Agent': 'Mozilla/5.0'} response = requests.get(url, headers=headers) if response.status_code == 200: soup = BeautifulSoup(response.text, 'lxml') title = soup.select_one('.title').text.strip() # 假设标题在class为"title"的元素中 content = soup.select_one('.content').text.strip() # 假设内容在class为"content"的元素中 return title, content else: print(f"Error fetching URL: {url}") base_url = "https://so.gushiwen.org/shiwen/" page_number = 1 # 起始页，替换为实际页码 for i in range(1, 6): # 这里假设总共有5页，实际要爬多少页根据实际情况调整 page_url = f"{base_url}?page={page_number}" title, content = get_poem_content(page_url) print(f"第{i}页诗歌标题：{title}") print(f"第{i}页诗歌内容：{content}\n") # 将内容存入文件或数据库，这里仅打印展示 with open(f'poem_{page_number}.txt', 'w', encoding='utf-8') as file: file.write(f"标题：{title}\n内容：{content}\n\n") ``` 这段代码会访问指定的URL并获取每页的第一首诗的信息，将它们打印出来，并将结果保存到文本文件中。记得替换`base_url`和`page_number`以适应古诗文网的实际布局。

阅读全文