python通过代码实现项目:税屋网站 网址:https://www.shui5.cn/article/BeiJingShiCaiShuiFaGui/ 要求:爬取10页数据,写入txt文档 数据内容: 1)标题 2)时间文档格式: [时间] 标题
时间: 2023-12-20 22:07:30 浏览: 75
好的,我会尽力回答你的问题。首先,需要用到 requests, BeautifulSoup 和 re 库来实现爬虫,以下是代码实现的步骤:
1. 导入需要用到的库:
```python
import requests
from bs4 import BeautifulSoup
import re
```
2. 定义要爬取的网站链接和页数:
```python
url = 'https://www.shui5.cn/article/BeiJingShiCaiShuiFaGui/'
pages = 10
```
3. 定义一个函数来爬取每一页的数据:
```python
def get_data(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
res = requests.get(url, headers=headers)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
articles = soup.find_all('a', href=re.compile('/article/'))
for article in articles:
title = article.get_text()
link = 'https://www.shui5.cn' + article['href']
date = re.findall(r'\d{4}-\d{2}-\d{2}', str(article))[0]
with open('data.txt', 'a+', encoding='utf-8') as f:
f.write('[{}] {}\n'.format(date, title))
```
4. 循环爬取每一页的数据:
```python
for page in range(1, pages+1):
page_url = url + 'index_{}.html'.format(page)
get_data(page_url)
```
5. 运行代码,数据将会写入到 data.txt 文件中。
完整的代码如下:
阅读全文