用Python写一个爬虫:要求:1.网站地址:https://news.pdsu.edu.cn/xxyw.htm 2.获取底部分页页码 3.获取学校要闻中每页的所有新闻的标题、链接和发布日期 4.将标题、链接和发布日期存入excel
时间: 2024-12-17 19:44:10 浏览: 4
opencv-contrib-python-4.6.0.66-cp36-abi3-win-amd64.whl
在Python中,我们可以使用`requests`, `BeautifulSoup`, 和 `pandas`库来完成这个任务。这里是一个基本的示例,首先需要安装缺失的库,如果尚未安装,可以使用pip:
```bash
pip install beautifulsoup4 requests pandas openpyxl
```
然后,我们可以按照以下步骤编写代码:
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
from openpyxl import Workbook
def get_news_pages(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
pagination = soup.find('div', class_='pagination')
pages = [int(a.text) for a in pagination.find_all('a') if a.text.isdigit()]
return pages
def parse_news_page(url, page_num):
params = {'p': str(page_num)}
response = requests.get(url, params=params)
soup = BeautifulSoup(response.text, 'lxml')
news_items = []
for item in soup.find_all('div', class_='list-item'):
title = item.h2.a.text
link = item.h2.a['href']
date = item.find('span', class_='time').text
news_items.append({'title': title, 'link': link, 'date': date})
return news_items
def save_to_excel(data, filename):
df = pd.DataFrame(data)
writer = pd.ExcelWriter(filename)
df.to_excel(writer, sheet_name='News', index=False)
writer.save()
url = 'https://news.pdsu.edu.cn/xxyw.htm'
pages = get_news_pages(url)
for page in pages:
news_data = parse_news_page(url, page)
save_to_excel(news_data, f'school_news_{page}.xlsx')
阅读全文