li.xpath('./a/img/@src').extract_first()

XPath表达式`./a/img/@src`用于Scrapy框架中的`li`元素，它会查找当前`li`下的所有`<a>`标签内的`<img>`标签的`src`属性。`extract_first()`方法则会从这些结果中提取第一个匹配项的值。例如，在Python Scrapy项目中，你可以这样操作： ```python response = scrapy.Request('your_url', callback=self.parse) # 假设self.parse是一个解析函数 def parse(self, response): img_source = response.css('li').xpath('./a/img/@src').extract_first() if img_source: print(f"First image source in li elements: {img_source}") else: print("No image found or no match for the XPath expression.") ``` 这段代码会在给定URL的响应中找到`li`元素下每个`<a>`标签中`<img>`的`src`属性，并打印第一个找到的图片源链接。

import scrapy from scuw.items import ScuwItem class DushuSpider(scrapy.Spider): name = "dushu" allowed_domains = ["www.dushu.com"] start_urls = ["https://www.dushu.com/lianzai/"] def parse(self, response): print('=================') li_list = response.xpath('//div[@class="bookslist"]//div[@class="book-info"]') for li in li_list: name = li.xpath('.//a/text()').extract_first() href = li.xpath('.//a/@href').extract_first() url = "https://www.dushu.com"+href yield scrapy.Request(url=url, callback=self.response_second, meta={'name': name}) def response_second(self, response): src = response.xpath('//div[@class="bookdetails-left"]//div[@class="pic"]/img/@src').extract_first() name = response.meta['name'] data = ScuwItem(src=src, name=name) yield data

这是一个简单的 Scrapy 爬虫示例代码，用于爬取读书网站 (https://www.dushu.com/) 上的书籍信息。代码中定义了一个名为 ScuwItem 的 Item 类，用于存储爬取到的书籍信息。以下是代码的详细说明： 1. 导入 Scrapy 和 ScuwItem 类： ```python import scrapy from scuw.items import ScuwItem ``` 2. 定义名为 DushuSpider 的 Spider 类： ```python class DushuSpider(scrapy.Spider): name = "dushu" allowed_domains = ["www.dushu.com"] start_urls = ["https://www.dushu.com/lianzai/"] ``` 这里定义了 Spider 的名称 (name)、允许爬取的域名 (allowed_domains) 和起始 URL (start_urls)。在这个示例中，Spider 将从 https://www.dushu.com/lianzai/ 开始爬取。 3. 定义 parse 方法： ```python def parse(self, response): print('=================') li_list = response.xpath('//div[@class="bookslist"]//div[@class="book-info"]') for li in li_list: name = li.xpath('.//a/text()').extract_first() href = li.xpath('.//a/@href').extract_first() url = "https://www.dushu.com"+href yield scrapy.Request(url=url, callback=self.response_second, meta={'name': name}) ``` 在这个方法中，首先打印了一条分隔线，然后使用 XPath 选择器找到了页面中所有书籍的信息。对于每个书籍，我们提取了书名和链接，并使用 yield 语句返回一个新的 Request 对象，以便在 response_second 方法中处理。这里我们还使用 meta 参数将书名传递给 response_second 方法。 4. 定义 response_second 方法： ```python def response_second(self, response): src = response.xpath('//div[@class="bookdetails-left"]//div[@class="pic"]/img/@src').extract_first() name = response.meta['name'] data = ScuwItem(src=src, name=name) yield data ``` 在这个方法中，我们使用 XPath 选择器找到了书籍详情页面中的书籍封面图片链接，并从 meta 中提取了书名。接着，我们创建了一个 ScuwItem 对象，并使用 yield 语句将其输出。 5. 在 settings.py 文件中启用 Item Pipeline： ```python ITEM_PIPELINES = { 'scuw.pipelines.ScuwPipeline': 300, } ``` 这个示例中，我们将数据存储到了数据库中，因此我们需要启用相应的 Item Pipeline。 6. 运行爬虫： ```bash scrapy crawl dushu ``` 以上就是这个爬虫示例的详细说明。

import scrapy from ychouse.items import YchouseItem # 导入item class EsfSpider(scrapy.Spider): name = "esf" allowed_domains = ["allowdomians"] start_urls = ["https://fc.cqyc.net/resoldhome/esf/list"] def parse(self, response): lilist = response.xpath('/html/body/div[6]/div[3]/ul/li') # print(lilist) for li in lilist: item = YchouseItem() #实例化item item["title"] = li.xpath('./div[2]/p[1]/a/text()').extract_first() item['href'] ='https://fc.cqyc.net' + li.xpath('./div[2]/p[1]/a/@href').extract_first() item['housetype'] = li.xpath('./div[2]/p[2]/span[1]/text()').extract_first() item['floor'] = li.xpath('./div[2]/p[2]/span[2]/text()').extract_first() # print(item) yield scrapy.Request(item['href'],callback=self.parseinfo,meta={'item':item}) #翻页 # next_url = 'https://fc.cqyc.net' + response.xpath('//span[@class="next-page"]/../@href').extract_first() # if next_url != response.url: # yield scrapy.Request(url=next_url,callback=self.parse) pass def parseinfo(self,response): item = response.meta['item'] item['houseimg']= response.xpath('/html/body/div[4]/div/div[3]/div[2]/ul/li[1]/img/@src').extract_first() print(item)

这段代码是使用Scrapy框架爬取重庆房产网的二手房列表信息，并且获取每个房源的详情页链接，然后进入详情页爬取更详细的房源信息。其中，使用了XPath语法来解析HTML页面，将解析得到的信息存储到自定义的Item对象中，最后使用yield返回给Scrapy框架。同时，在进入详情页时，使用了Scrapy的meta参数将Item对象传递给了回调函数parseinfo，以便在该函数中继续对Item对象添加信息。

阅读全文

li.xpath('./a/img/@src').extract_first()

相关推荐

satoshop.rar_Jsp/Servlet_Java_

cmt.rar_Cmt.rar_extract comments_抽取

xpath_helper_2_0_2.zip

【Lxml.html与Scrapy框架的集成】：构建高效的爬虫应用，专家实战教程

一个基于Qt Creator（qt,C++）实现中国象棋人机对战

热带雨林自驾游自然奇观探索.doc

冰川湖自驾游冰雪交融景象.doc

C51 单片机数码管使用 Keil项目C语言源码

基于智能算法的无人机路径规划研究 附Matlab代码.rar

前端分析-2023071100789s12

Delphi 12.3控件之Laz-制作了一些窗体和对话框样式.7z

ocaml-docs-4.05.0-6.el7.x64-86.rpm.tar.gz

学习笔记-沁恒第六讲-米醋

工业机器人技术讲解【36页】.pptx

基于CentOS 7和Docker环境下安装和配置Elasticsearch数据库

基于CNN和FNN的进化神经元模型的快速响应尖峰神经网络 附Matlab代码.rar

网络小说的类型创新、情节设计与角色塑造.doc

毕业设计-基于springboot+vue开发的学生考勤管理系统【源码+sql+可运行】50311.zip

大家在看

XenCenter7.6中文版

参数定义-cdh软硬件配置建议

IEC-CISPR16-1-1-2006 & IEC-CISPR22.pdf

迈瑞Benevision中心监护系统 Central Monitoring System

asltbx中文手册

最新推荐

一个基于Qt Creator（qt,C++）实现中国象棋人机对战

Windows下操作Linux图形界面的VNC工具

【SketchUp Ruby API：从入门到精通】

VMware虚拟机打开虚拟网络编辑器出现由于找不到vnetlib.dll,无法继续执行代码。重新安装程序可能会解决问题

基于Preact的高性能PWA实现定期天气信息更新

从停机到上线，EMC VNX5100控制器SP更换的实战演练

ubuntu labelme中文版安装

全新免费HTML5商业网站模板发布

EMC VNX5100控制器SP更换全流程指南：新手到高手的必备技能

lamada函数

基于智能算法的无人机路径规划研究附Matlab代码.rar

基于CNN和FNN的进化神经元模型的快速响应尖峰神经网络附Matlab代码.rar