编写一个py爬虫脚本,每天爬取,https://www.reuters.com/ ;https://www.ap.org/en/ 新闻
时间: 2023-11-10 10:05:48 浏览: 49
好的,我们可以使用Python中的requests和BeautifulSoup库来编写爬虫脚本。以下是示例代码:
```python
import requests
from bs4 import BeautifulSoup
import datetime
# Reuters新闻
reuters_url = 'https://www.reuters.com/'
reuters_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
reuters_response = requests.get(reuters_url, headers=reuters_headers)
reuters_soup = BeautifulSoup(reuters_response.text, 'html.parser')
# AP新闻
ap_url = 'https://www.ap.org/en/'
ap_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
ap_response = requests.get(ap_url, headers=ap_headers)
ap_soup = BeautifulSoup(ap_response.text, 'html.parser')
# 保存新闻
now = datetime.datetime.now()
date = now.strftime('%Y-%m-%d')
with open(f'reuters_news_{date}.txt', 'w', encoding='utf-8') as f:
for article in reuters_soup.find_all('article'):
title = article.h3.text.strip()
f.write(title + '\n')
with open(f'ap_news_{date}.txt', 'w', encoding='utf-8') as f:
for article in ap_soup.find_all('article'):
title = article.h1.text.strip()
f.write(title + '\n')
```
该脚本将获取今天的《路透社》和AP通讯社的新闻标题,并将它们分别保存到名为`reuters_news_<date>.txt`和`ap_news_<date>.txt`的文件中,其中`<date>`是当前日期。您可以将此脚本添加到计划任务中,以每天自动运行并获取最新的新闻标题。