首页python爬虫获取古诗文网

python爬虫获取古诗文网

时间: 2024-10-16 07:07:24 浏览: 108

Python爬虫用于抓取网页内容非常常见，如果想从古诗文网这类网站获取数据，你可以按照以下步骤操作： 1. **安装必要的库**：首先需要安装`requests`、`BeautifulSoup`或`Scrapy`等库，它们分别是发送HTTP请求和解析HTML内容的主要工具。 ```bash pip install requests beautifulsoup4 ``` 2. **发送请求**：使用`requests.get()`函数向古诗文网的页面URL发送GET请求，并获取响应内容。 ```python import requests url = "https://so.gushiwen.org/shiwen" # 古诗文网的诗词分类页 response = requests.get(url) ``` 3. **解析HTML**：使用BeautifulSoup解析返回的HTML内容，提取出你需要的数据。通常会查找包含诗歌信息的标签，如`<div>`或`<p>`等。 ```python from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, 'html.parser') poems = soup.find_all('div', class_='content') # 示例，找到包含诗歌内容的div元素 ``` 4. **提取并处理数据**：遍历找到的元素，提取每首诗的标题、作者、正文等内容。可以使用`.text`属性获取文本，然后根据需要存储到列表、字典或其他数据结构中。 5. **保存数据**：将提取的信息保存成文件，比如CSV、JSON或直接插入数据库。 ```python with open('poems.txt', 'w', encoding='utf-8') as f: for poem in poems: title = poem.find('h3').text.strip() if poem.h3 else '' author = poem.find('span', class_='author').text.strip() if poem.span else '' content = poem.prettify().split('<br>') # 提取每一行作为诗句 f.write(f"{title}\n{author}\n{''.join(content)}\n\n") ```

阅读全文