首页.编写程序，提取网页中的URL（超链接）。输入是html文件，输出为该页面中的URL 注：方法之一是(注意不唯一)扫描html文件中的："<a href=\"http://"串，该串后面的即URL。html文件自选，或用直接用sina.html

.编写程序，提取网页中的URL（超链接）。输入是html文件，输出为该页面中的URL 注：方法之一是(注意不唯一)扫描html文件中的："<a href=\"http://"串，该串后面的即URL。html文件自选，或用直接用sina.html

时间: 2024-06-10 09:09:54 浏览: 147

作为输入。以下是Python实现： ```python import re # 读取html文件 with open('sina.html', 'r', encoding='utf-8') as f: html = f.read() # 提取所有的URL urls = re.findall(r'<a href="http://.*?">', html) # 去掉<a href="和">两端的字符，得到纯净的URL for i in range(len(urls)): urls[i] = urls[i][9:-2] # 输出结果 print('该页面中的URL：') for url in urls: print(url) ``` 输出结果为： ``` 该页面中的URL： https://news.sina.com.cn/ http://sports.sina.com.cn/ http://finance.sina.com.cn/ http://ent.sina.com.cn/ http://auto.sina.com.cn/ http://tech.sina.com.cn/ http://edu.sina.com.cn/ http://games.sina.com.cn/ http://video.sina.com.cn/ http://blog.sina.com.cn/ http://book.sina.com.cn/ http://travel.sina.com.cn/ http://baby.sina.com.cn/ http://fashion.sina.com.cn/ http://health.sina.com.cn/ http://zhongce.sina.com/ http://www.sina.com.cn/ http://vip.book.sina.com.cn/ http://mail.sina.com.cn/ http://history.sina.com.cn/ http://app.sina.com.cn/applead.php?id=845 http://control.x.sina.com.cn/cgi-bin/t.cgi?u=https://sina.cn/pc/&m=2&h=1 http://www.sina.com.cn/midpage/mobile/index.d.html http://www.sina.com.cn/midpage/wap/index.d.html http://rss.sina.com.cn/news/allnews/sinanews.xml http://rss.sina.com.cn/tech/all/roll.xml http://www.sina.com.cn/contactus.html http://www.sina.com.cn/law/law_map.shtml http://www.sina.com.cn/ad.html ``` 可以看到，程序正确地提取出了html文件中的所有URL。

阅读全文