首页给出下载200篇纯英文新闻并保存到txt的爬虫程序

给出下载200篇纯英文新闻并保存到txt的爬虫程序

时间: 2024-06-09 09:04:44 浏览: 65

以下是使用Python和BeautifulSoup库编写的爬虫程序，可以爬取CNN新闻网站的文章并保存到txt文件中。 ```python import requests from bs4 import BeautifulSoup # 设置要爬取的新闻数量 num_articles = 200 # 请求CNN的首页，并解析HTML url = 'https://www.cnn.com/' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # 找到首页上的新闻链接 articles = [] links = soup.find_all('a') for link in links: if link.get('href') is not None and '/article/' in link.get('href'): articles.append(link.get('href')) # 爬取新闻内容并保存到txt文件 count = 0 for article_url in articles: if count >= num_articles: break response = requests.get(article_url) soup = BeautifulSoup(response.text, 'html.parser') title = soup.find('h1', {'class': 'pg-headline'}).text.strip() content = soup.find('div', {'class': 'zn-body__paragraph'}).text.strip() with open('articles.txt', 'a', encoding='utf-8') as f: f.write(title + '\n') f.write(content + '\n\n') count += 1 ``` 在运行程序之前，需要先安装以下库： - requests - BeautifulSoup 可以通过以下命令安装： ```bash pip install requests beautifulsoup4 ``` 运行程序后，会在同级目录下生成一个名为`articles.txt`的文件，其中包含200篇新闻的标题和内容。

阅读全文