python编写程序，爬取www.sdtbu.edu.cn网站上的所有新闻，每个新闻生成一个子文件夹，网页上的文本存放于一个文本文件，网页上的图片单独存放于该文件夹中

好的，这个问题需要编写一个爬虫程序来实现。我们可以使用Python的requests和BeautifulSoup库来获取网页内容和解析HTML。首先，我们需要导入需要的库： ```python import os import requests from bs4 import BeautifulSoup ``` 然后，我们需要定义一个函数来获取网页内容，并使用BeautifulSoup来解析HTML： ```python def get_html(url): res = requests.get(url) res.encoding = 'utf-8' soup = BeautifulSoup(res.text, 'html.parser') return soup ``` 接下来，我们需要找到新闻列表的链接，并遍历每一个新闻链接。我们可以使用find_all函数来获取所有的新闻链接： ```python url = 'http://www.sdtbu.edu.cn/' soup = get_html(url) news_list = soup.find_all('a', class_='news-title') for news in news_list: news_url = news['href'] news_title = news.text print(news_url, news_title) ``` 现在我们已经获取了每个新闻的链接和标题，接下来我们需要在本地创建一个子文件夹，并将新闻页面的内容和图片保存到该文件夹中。我们可以使用os库中的mkdir函数来创建文件夹，并使用requests库中的get函数来下载图片。 ```python for news in news_list: news_url = news['href'] news_title = news.text # 创建子文件夹 folder_name = news_title.replace('/', '-') os.mkdir(folder_name) # 获取新闻页面的内容和图片 news_soup = get_html(news_url) content = news_soup.find('div', class_='content').text with open(folder_name + '/content.txt', 'w', encoding='utf-8') as f: f.write(content) img_list = news_soup.find_all('img') for i, img in enumerate(img_list): img_url = img['src'] img_data = requests.get(img_url).content with open(folder_name + '/img{}.jpg'.format(i), 'wb') as f: f.write(img_data) ``` 完整代码如下： ```python import os import requests from bs4 import BeautifulSoup def get_html(url): res = requests.get(url) res.encoding = 'utf-8' soup = BeautifulSoup(res.text, 'html.parser') return soup url = 'http://www.sdtbu.edu.cn/' soup = get_html(url) news_list = soup.find_all('a', class_='news-title') for news in news_list: news_url = news['href'] news_title = news.text # 创建子文件夹 folder_name = news_title.replace('/', '-') os.mkdir(folder_name) # 获取新闻页面的内容和图片 news_soup = get_html(news_url) content = news_soup.find('div', class_='content').text with open(folder_name + '/content.txt', 'w', encoding='utf-8') as f: f.write(content) img_list = news_soup.find_all('img') for i, img in enumerate(img_list): img_url = img['src'] img_data = requests.get(img_url).content with open(folder_name + '/img{}.jpg'.format(i), 'wb') as f: f.write(img_data) ```

阅读全文

CSDN会员

开通CSDN年卡参与万元壕礼抽奖

海量 VIP免费资源千本正版电子书商城会员专享价千门课程&专栏

全年可省5,000元立即开通

python编写程序，爬取www.sdtbu.edu.cn网站上的所有新闻，每个新闻生成一个子文件夹，网页上的文本存放于一个文本文件，网页上的图片单独存放于该文件夹中

最新推荐

用python爬取网页并导出为word文档.docx

Python爬虫爬取新闻资讯案例详解

Python3 实现爬取网站下所有URL方式

Python实现合并同一个文件夹下所有txt文件的方法示例

用python爬取网页并用mongodb保存.docx

Angular实现MarcHayek简历展示应用教程

管理建模和仿真的文件

深入剖析：内存溢出背后的原因、预防及应急策略（专家版）

Java中如何对年月日时分秒的日期字符串作如下处理：如何日期分钟介于两个相连的半点之间，就将分钟数调整为前半点

Crossbow Spot最新更新 - 获取Chrome扩展新闻