import requests from lxml import etree headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.57' } url='https://pic.netbian.com/' page_text= requests.get(url=url,headers=headers).text tree = etree.HTML(page_text) li_list=tree.xpath('//div[@class="slist"]/ul/li') for li in li_list: pic='https://pic.netbian.com'+li.xpath('./a/img/@src')[0] name=li.xpath('./a/img@alt')[0]+'.jpg' print(pic,name)
时间: 2023-12-21 10:04:36 浏览: 54
这段代码是一个Python程序,它使用requests和lxml库来从网站https://pic.netbian.com/获取数据,并使用XPath从HTML页面中提取图像的URL和名称。这里的问题在于,XPath表达式'.//a/img@alt'中缺少'/',正确的表达式应该是'./a/img/@alt'。这个错误可能导致名称提取失败。
相关问题
解决一下以下代码报错的import requests from lxml import etree url = 'https://www.shanghairanking.cn/institution' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36' } res = requests.get(url=url,headers=headers) print(res.status_code) html = etree.HTML(res.text) lis = html.xpath("//*[@id="__layout"]/div/div[2]/div[2]/div[1]") print(len(lis))
这段代码有一个小错误,就是在xpath的语句中双引号没有转义,需要改成单引号。修改后的代码如下:
```python
import requests
from lxml import etree
url = 'https://www.shanghairanking.cn/institution'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}
res = requests.get(url=url, headers=headers)
print(res.status_code)
html = etree.HTML(res.text)
lis = html.xpath('//*[@id="__layout"]/div/div[2]/div[2]/div[1]')
print(len(lis))
```
这样就可以正确运行了,输出页面中指定元素的数量。
import requests from lxml import etree url = 'https://jn.lianjia.com/zufang/' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36' } response = requests.get(url, headers=headers) html = response.text selector = etree.HTML(html) total_page_list = selector.xpath('//div[@class="content__pg"]/div/@data-totalpage') if len(total_page_list) > 0: total_page = total_page_list[0] print('总页数为:', total_page) else: print('未匹配到结果')报错
这段代码没有报错,但是你忘记了循环爬取每一页的租房信息。你可以将代码改为以下形式:
```
import requests
from lxml import etree
url = 'https://jn.lianjia.com/zufang/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
}
response = requests.get(url, headers=headers)
html = response.text
selector = etree.HTML(html)
total_page_list = selector.xpath('//div[@class="content__pg"]/div/@data-totalpage')
if len(total_page_list) > 0:
total_page = total_page_list[0]
print('总页数为:', total_page)
# 循环爬取每一页的租房信息
for num in range(1, int(total_page) + 1):
url = 'https://jn.lianjia.com/zufang/pg{}/#contentList'.format(num)
# 发送请求并处理响应
# ...
else:
print('未匹配到结果')
```
这段代码首先发送一个请求到济南市链家网页,然后使用lxml库中的etree模块将网页转化为可操作的对象,接着使用xpath定位到包含总页数的元素,并进行判断。如果匹配成功,就输出总页数,并循环爬取每一页的租房信息。在循环中,使用.format()方法动态生成每一页的URL,并发送请求并处理响应。如果未匹配到结果,则输出相关信息。