首页python获取https://www.8kmm.com娱乐休闲菜单下的所有链接，并按照他的网站二级分类创建文件并保存，默认使用UTF-8编码

python获取https://www.8kmm.com娱乐休闲菜单下的所有链接，并按照他的网站二级分类创建文件并保存，默认使用UTF-8编码

时间: 2024-06-10 15:05:34 浏览: 5

您好，我可以为您解答这个问题。首先，您可以使用Python的requests和BeautifulSoup库来爬取网页内容，并通过正则表达式匹配到所有的链接。具体的代码示例如下： ```python import requests from bs4 import BeautifulSoup import re import os # 发送https请求获取网页内容 url = 'https://www.8kmm.com/zh-CN/EntertainmentMenu' response = requests.get(url) # 使用BeautifulSoup解析网页 soup = BeautifulSoup(response.content, 'html.parser') # 获取菜单下所有网站的链接 links = [] for a_tag in soup.find_all('a'): if a_tag.has_attr('href') and 'EntertainmentMenu' not in a_tag['href']: links.append(a_tag['href']) # 按照网站二级分类创建文件并保存链接 for link in links: # 解析链接中的二级分类 m = re.match('.*/EntertainmentMenu/(.*?)/(.*?)/(.*?)\.html', link) if m: folder_path = m.group(1) + '/' + m.group(2) file_path = folder_path + '/' + m.group(3) + '.txt' try: # 创建文件夹和文件 os.makedirs(folder_path, exist_ok=True) with open(file_path, 'w', encoding='utf-8') as f: f.write(link) except Exception as e: print(e) ``` 这段代码会将链接保存在以二级分类名为文件名的文本文件中，存储在以一级分类名为文件夹名的目录下。需要注意的是，该网站的内容是动态生成的，可能会存在反爬虫机制。在爬取时需要设置请求头、限制爬取速度等措施，避免被封IP。另外，爬虫行为需要遵守相关法律法规和网站的使用协议，建议在爬取前先与网站管理员联系取得授权。