首页python爬虫正则表达式结合xpath获取子网页网址

python爬虫正则表达式结合xpath获取子网页网址

时间: 2023-05-25 11:06:39 浏览: 87

以下是示例代码： ``` import re import requests from lxml import etree url = "https://www.example.com" response = requests.get(url) html = response.content.decode() # 用正则表达式匹配子网页网址 regex = r'<a href="(http[s]?://www.example.com/\d+/.+?)">' sub_urls = re.findall(regex, html) # 用xpath解析子网页内容 for sub_url in sub_urls: sub_response = requests.get(sub_url) sub_html = sub_response.content.decode() sub_tree = etree.HTML(sub_html) # 获取子网页中需要的信息 info = sub_tree.xpath('//div[@class="info"]/text()') print(info) ``` 首先通过requests获取主网页的内容，然后用正则表达式匹配出所有子网页的网址。正则表达式解释： - `r'<a href="(http[s]?://www.example.com/\d+/.+?)">'`：匹配以`<a href="`开头，以`">`结尾的字符串；其中`(http[s]?://www.example.com/\d+/.+?)`是一个子串，它表示http或https开头，紧接着是主网页的地址，后面接任意数字和字符（`+?`表示非贪婪模式，即最少匹配，以避免匹配到多个链接）。接着循环遍历所有子网页的网址，用requests获取子网页的内容，并用lxml库的etree模块解析html内容，以便提取所需的信息。以上仅为示例，具体用法还需根据具体网站的html结构进行调整。