网络爬虫技术解析及其工作流程

1 下载量 138 浏览量 更新于2024-06-29 收藏 847KB PPT 举报
"爬虫技术ppt课件.ppt" 爬虫技术是互联网信息挖掘的重要手段,它通过自动化的方式遍历互联网,收集和整理网页数据。网络爬虫,又称网页蜘蛛、网络机器人或网页追逐者,是一种遵循特定规则的程序或脚本,用于抓取并存储网页内容。它们在网络中的活动对搜索引擎的索引构建起着关键作用,帮助用户快速找到所需信息。 网络爬虫的基本结构通常包括三个主要部分:控制器、解析器和资源库。控制器管理整个爬虫系统的运行,调度多个爬虫线程,分配抓取任务。解析器则负责下载网页,并对网页内容进行预处理,去除JavaScript脚本、CSS样式、空格、HTML标签等非文本内容,以便于后续分析。资源库,一般采用大型数据库如Oracle,存储抓取到的网页资源,并建立索引来加速数据检索。 网络爬虫的工作流程可以概括为以下步骤: 1. 首先,选择一批精心挑选的初始种子URL,这些URL通常是爬虫开始爬取的入口点,具有代表性且内容丰富。 2. 将这些种子URL放入待抓取URL队列,这是一个等待处理的任务列表。 3. 爬虫从队列中取出一个URL,向对应网站发起HTTP请求,下载网页内容。 4. 解析器对下载的网页进行处理,提取出新的链接并加入待抓取URL队列,同时将网页内容存储到资源库中。 5. 步骤3和4反复进行,直到达到预设的停止条件,如抓取一定数量的页面、达到时间限制或遍历完所有发现的链接。 在实际应用中,爬虫可能还需要考虑反爬策略,如设置延时以避免过于频繁的请求导致服务器压力过大,或者模拟浏览器行为以绕过网站的反爬机制。此外,爬虫还需要处理各种异常情况,如网络连接问题、服务器错误等。 网络爬虫的应用广泛,例如搜索引擎的网页索引、市场分析的数据采集、社交媒体的研究等。然而,合法使用爬虫需要遵守网站的robots.txt协议以及相关的法律法规,尊重网站的版权和隐私政策。 爬虫技术是互联网信息获取和分析的重要工具,其设计和实现涉及到网络编程、数据解析、数据库管理等多个领域,对于理解和掌握网络信息的动态有着至关重要的作用。

SELECT PIS.SHOW_FLT_DETAIL AS SHOW_FLT_DETAIL -- new , PIS.SHOW_AWB_DETAIL AS SHOW_AWB_DETAIL -- new , PIS.DISPLAY_AIRLINE_CODE AS CARRIER_CODE , DECODE(PIS.REVERT_FLOW,'N',PIS.FLOW_TYPE,DECODE(PIS.FLOW_TYPE,'I','E','I')) AS FLOW_TYPE , PIS.SHIP_TO_LOCATION AS SHIP_TO_LOCATION , PIS.INVOICE_SEQUENCE AS INVOICE_SEQUENCE , PFT.FLIGHT_DATE AS FLIGHT_DATE , PFT.FLIGHT_CARRIER_CODE AS FLIGHT_CARRIER_CODE , PFT.FLIGHT_SERIAL_NUMBER AS FLIGHT_SERIAL_NUMBER , PFT.FLOW_TYPE AS AIRCRAFT_FLOW , FAST.AIRCRAFT_SERVICE_TYPE AS AIRCRAFT_SERVICE_TYPE , PPT.AWB_NUMBER AS AWB_NUMBER , PPT.WEIGHT AS WEIGHT , PPT.CARGO_HANDLING_OPERATOR AS CARGO_HANDLING_OPERATOR , PPT.SHIPMENT_PACKING_TYPE AS SHIPMENT_PACKING_TYPE , PPT.SHIPMENT_FLOW_TYPE AS SHIPMENT_FLOW_TYPE , PPT.SHIPMENT_BUILD_TYPE AS SHIPMENT_BUILD_TYPE , PPT.SHIPMENT_CARGO_TYPE AS SHIPMENT_CARGO_TYPE , PPT.REVENUE_TYPE AS REVENUE_TYPE , PFT.JV_FLIGHT_CARRIER_CODE AS JV_FLIGHT_CARRIER_CODE , PPT.PORT_TONNAGE_UID AS PORT_TONNAGE_UID , PPT.AWB_UID AS AWB_UID , PIS.INVOICE_SEPARATION_UID AS INVOICE_SEPARATION_UID , PFT.FLIGHT_TONNAGE_UID AS FLIGHT_TONNAGE_UID FROM PN_FLT_TONNAGES PFT , FZ_AIRLINES FA , PN_TONNAGE_FLT_PORTS PTFP , PN_PORT_TONNAGES PPT , FF_AIRCRAFT_SERVICE_TYPES FAST , SR_PN_INVOICE_SEPARATIONS PIS --new , SR_PN_INVOICE_SEP_DETAILS PISD--new , SR_PN_INV_SEP_PORT_TONNAGES PISPT --new WHERE PFT.FLIGHT_OPERATION_DATE >= trunc( CASE :rundate WHEN TO_DATE('01/01/1900', 'DD/MM/YYYY') THEN ADD_MONTHS(SYSDATE,-1) ELSE ADD_MONTHS(:rundate,-1) END, 'MON') AND PFT.FLIGHT_OPERATION_DATE < trunc( CASE :rundate WHEN TO_DATE('01/01/1900', 'DD/MM/YYYY') THEN TRUNC(SYSDATE) ELSE TRUNC(:rundate) END, 'MON') AND PFT.TYPE IN ('C', 'F') AND PFT.RECORD_TYPE = 'M' AND (PFT.TERMINAL_OPERATOR NOT IN ('X', 'A') OR (PFT.TERMINAL_OPERATOR <> 'X' AND FA.CARRIER_CODE IN (SELECT * FROM SPECIAL_HANDLING_AIRLINE) AND PPT.REVENUE_TYPE IN (SELECT * FROM SPECIAL_REVENUE_TYPE) AND PPT.SHIPMENT_FLOW_TYPE IN (SELECT * FROM SPECIAL_SHIPMENT_FLOW_TYPE) AND PFT.FLIGHT_OPERATION_DATE >= (select EFF_DATE from SPECIAL_HANDLING_EFF_DATE) )) AND PFT.DELETING_DATETIME IS NULL AND FA.AIRLINE_UID = PFT.AIRLINE_UID AND FA.DELETING_DATETIME IS NULL AND PTFP.FLIGHT_TONNAGE_UID = PFT.FLIGHT_TONNAGE_UID AND PTFP.RECORD_TYPE = 'M' AND PTFP.DELETING_DATETIME IS NULL AND PPT.TONNAGE_FLIGHT_PORT_UID (+)= PTFP.TONNAGE_FLIGHT_PORT_UID AND PPT.RECORD_TYPE (+)= 'M' AND PPT.DISCREPANCY_TYPE (+)= 'NONE' AND PPT.ADJUSTMENT_INC_FLAG (+)= 'Y' AND PPT.DELETING_DATETIME (+) IS NULL AND FAST.AIRCRAFT_SERVICE_TYPE_UID = PFT.AIRCRAFT_SERVICE_TYPE_UID AND FAST.DELETING_DATETIME IS NULL AND PIS.TEMPORAL_NAME = TO_CHAR((CASE :rundate --new WHEN TO_DATE('01/01/1900', 'DD/MM/YYYY') THEN TRUNC(SYSDATE) ELSE TRUNC(:rundate) END ), 'YYYYMM') || '00' AND PIS.INVOICE_SEPARATION_UID = PISD.INVOICE_SEPARATION_UID --new AND PISD.INVOICE_SEP_DETAIL_UID = PISPT.INVOICE_SEP_DETAIL_UID --new AND PISPT.PORT_TONNAGE_UID = PPT.PORT_TONNAGE_UID --new AND PIS.PRINT_SUPPORTING_DOC = 'Y';上面是oracle的写法,请转成spark SQL的写法。

2023-06-02 上传