首页爬取上外主页http://www.shisu.edu.cn/ 的前几条新闻的标题和对应的超链接，并按每行"标题超链接"的格式保存到txt 文件中建议：模块化编程，分为几个函数实现： 1)获取html 页面内容：gethtmltxt(url) 2)数据提取过滤：getContent(html), 把抓取的数据保存到字典中，返字典类型 3)保存数据：saveData(dic)// 把字典的数据保存到文件中。

爬取上外主页http://www.shisu.edu.cn/ 的前几条新闻的标题和对应的超链接，并按每行"标题超链接"的格式保存到txt 文件中建议：模块化编程，分为几个函数实现： 1)获取html 页面内容：gethtmltxt(url) 2)数据提取过滤：getContent(html), 把抓取的数据保存到字典中，返字典类型 3)保存数据：saveData(dic)// 把字典的数据保存到文件中。

时间: 2024-02-29 13:57:23 浏览: 78

好的，我会根据您的要求进行操作。以下是实现代码： ```python import requests from bs4 import BeautifulSoup def gethtmltxt(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def getContent(html): newsDict = {} soup = BeautifulSoup(html, 'html.parser') newsList = soup.find_all('div', class_='article')[0].find_all('li') for news in newsList: title = news.a.string link = 'http://www.shisu.edu.cn' + news.a['href'] newsDict[title] = link return newsDict def saveData(dic): with open('news.txt', 'w', encoding='utf-8') as f: for key, value in dic.items(): f.write(key + ' ' + value + '\n') def main(): url = 'http://www.shisu.edu.cn/' html = gethtmltxt(url) newsDict = getContent(html) saveData(newsDict) if __name__ == '__main__': main() ``` 运行程序后，将会在当前目录下生成一个名为`news.txt`的文件，其中包含了上外主页前几条新闻的标题和对应的超链接。

阅读全文