Scrapy 1.4.0 完整教程：从入门到实战关键概念详解

5星 · 超过95%的资源需积分: 10 154 浏览量更新于2024-07-19 收藏 1.15MB PDF 举报

Scrapy文档1.4.0 是针对Scrapy框架的详尽指南，由Scrapy开发者于2017年6月1日发布。Scrapy是一个强大的网络爬虫框架，专为高效、灵活的网页数据抓取而设计。这份文档涵盖了从入门到高级概念的广泛内容，旨在帮助用户理解和掌握Scrapy的核心功能。 **第一部分：入门与求助** - **首先的步骤**：章节介绍了Scrapy的基本概念和获取帮助的方式，包括官方文档、邮件列表和社区支持，以便新用户快速上手。 **第二部分：基本概念** - **命令行工具**：讲解了如何使用Scrapy的命令行工具来管理和运行爬虫。 - **蜘蛛(Spiders)**：介绍Spiders是Scrapy的核心组件，它们定义了爬取的逻辑和规则，如何处理请求和响应。 - **选择器(Selectors)**：详细解释了XPath和CSS选择器在Scrapy中的使用，用于解析HTML并提取所需数据。 - **项目结构与Items**：讲述了项目的组织结构，以及如何定义和使用Items（数据结构）来存储抓取的数据。 - **ItemLoaders**：介绍如何利用ItemLoaders优化数据提取过程，减少重复代码。 - **Scrapy Shell**：提供了交互式环境，用于测试和调试爬虫脚本。 - **Item Pipeline**：阐述了数据处理流程，包括清洗、验证和存储的各个环节。 - **数据导出(Feed exports)**：讲解了如何将抓取的数据导出为多种格式，如JSON、CSV等。 - **请求和响应(Requests and Responses)**：深入解析HTTP请求和响应的处理机制。 - **链接提取(Link Extractors)**：探讨了如何识别和处理网页中的链接，实现爬虫的自动发现。 - **设置(Settings)**：介绍Scrapy配置文件，允许用户自定义爬虫的行为和参数。 - **异常处理(Exceptions)**：讨论了可能遇到的错误和异常处理策略。 **第三部分：内置服务** - **日志(Logging)**：解释了Scrapy的日志系统，如何记录和查看爬虫执行过程中的信息。 - **统计收集(Stats Collection)**：讲解如何收集和分析爬虫性能数据。 - **邮件发送(Mailing)**：说明如何通过Scrapy发送电子邮件通知。 - **telnet控制台(Telnet Console)**：介绍远程控制Scrapy进程的工具。 - **Web服务(WebSocket)**：涉及Scrapy如何与Web服务进行交互或触发其他服务。 **第四部分：解决特定问题** - **常见问题(Frequently Asked Questions)**：解答了一些新手常见的问题，如性能瓶颈、错误排查等。 - **调试Spiders**：提供了调试策略和技术，帮助用户定位和修复代码问题。 - **Spiders合同(Spider Contracts)**：讲解了编写可维护和扩展的Spider的最佳实践。 - **通用实践(Common Practices)**：推荐了一些通用的Scrapy使用和开发规范。 - **广度爬取(Broad Crawls)**：讨论如何控制爬虫的深度和广度，避免过深或过广导致的问题。 - **浏览器辅助工具**：介绍了如何结合Firefox和Firebug进行更精细的网页分析和调试。 - **内存泄漏检测(Debugging memory leaks)**：教授如何诊断和防止Scrapy内存泄漏。 - **文件和图片下载与处理(Downloading and processing files and images)**：讲解如何下载和处理网页中的多媒体资源。 - **部署部署(Spider Deployment)**：提供部署Scrapy爬虫到生产环境的指导。 - **扩展功能(AutoThrottle extension)**：介绍一个重要的扩展插件，用于自动调整爬取速度。 - **性能测试与基准(Benchmarking)**：讨论如何评估和优化Scrapy的性能。 Scrapy文档1.4.0是一份全面的指南，无论是初次接触Scrapy的新手还是希望深入了解其内部工作的开发者，都能从中获得宝贵的信息和实践经验。通过阅读和实践文档中的内容，用户可以构建高效、稳定的网络爬虫应用。

Scrapy Documentation, Release 1.4.0

'http://quotes.toscrape.com/page/2/',

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

page = response.url.split("/")[-2]

filename = 'quotes-%s.html' % page

with open(filename, 'wb') as f:

f.write(response.body)

self.log('Saved file %s' % filename)

As you can see, our Spider subclasses scrapy.Spider and deﬁnes some attributes and methods:

• name: identiﬁes the Spider. It must be unique within a project, that is, you can’t set the same name for different

Spiders.

• start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator

function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from

these initial requests.

• parse(): a method that will be called to handle the response downloaded for each of the requests made.

The response parameter is an instance of TextResponse that holds the page content and has further helpful

methods to handle it.

The parse() method usually parses the response, extracting the scraped data as dicts and also ﬁnding new

URLs to follow and creating new requests (Request) from them.

How to run our spider

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl quotes

This command runs the spider with name quotes that we’ve just added, that will send some requests for the

quotes.toscrape.com domain. You will get an output similar to this:

... (omitted for brevity)

2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened

2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/

˓→min), scraped 0 items (at 0 items/min)

2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.

˓→0.0.1:6023

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.

˓→toscrape.com/robots.txt> (referer: None)

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.

˓→toscrape.com/page/1/> (referer: None)

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.

˓→toscrape.com/page/2/> (referer: None)

2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html

2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html

2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)

...

Now, check the ﬁles in the current directory. You should notice that two new ﬁles have been created: quotes-1.html

and quotes-2.html, with the content for the respective URLs, as our parse method instructs.

12 Chapter 2. First steps

Scrapy Documentation, Release 1.4.0

[ ... Scrapy log here ... ]

2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.

˓→toscrape.com/page/1/> (referer: None)

[s] Available Scrapy objects:

[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)

[s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>

[s] item {}

[s] request <GET http://quotes.toscrape.com/page/1/>

[s] response <200 http://quotes.toscrape.com/page/1/>

[s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>

[s] spider <DefaultSpider 'default' at 0x7fa91c8af990>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

>>>

Using the shell, you can try selecting elements using CSS with the response object:

>>> response.css('title')

[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

The result of running response.css('title') is a list-like object called SelectorList, which represents a

list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to ﬁne-grain

the selection or extract the data.

To extract the text from the title above, you can do:

>>> response.css('title::text').extract()

['Quotes to Scrape']

There are two things to note here: one is that we’ve added ::text to the CSS query, to mean we want to select

only the text elements directly inside <title> element. If we don’t specify ::text, we’d get the full title element,

including its tags:

>>> response.css('title').extract()

['<title>Quotes to Scrape</title>']

The other thing is that the result of calling .extract() is a list, because we’re dealing with an instance of

SelectorList. When you know you just want the ﬁrst result, as in this case, you can do:

>>> response.css('title::text').extract_first()

'Quotes to Scrape'

As an alternative, you could’ve written:

>>> response.css('title::text')[0].extract()

'Quotes to Scrape'

However, using .extract_first() avoids an IndexError and returns None when it doesn’t ﬁnd any element

matching the selection.

There’s a lesson here: for most scraping code, you want it to be resilient to errors due to things not being found on a

page, so that even if some parts fail to be scraped, you can at least get some data.

Besides the extract() and extract_first() methods, you can also use the re() method to extract using

regular expressions:

14 Chapter 2. First steps

Scrapy Documentation, Release 1.4.0

>>> response.css('title::text').re(r'Quotes.

['Quotes to Scrape']

>>> response.css('title::text').re(r'Q\w+')

['Quotes']

>>> response.css('title::text').re(r'(\w+) to (\w+)')

['Quotes', 'Scrape']

In order to ﬁnd the proper CSS selectors to use, you might ﬁnd useful opening the response page from the shell in

your web browser using view(response). You can use your browser developer tools or extensions like Firebug

(see sections about Using Firebug for scraping and Using Firefox for scraping).

Selector Gadget is also a nice tool to quickly ﬁnd CSS selector for visually selected elements, which works in many

browsers.

XPath: a brief intro

Besides CSS, Scrapy selectors also support using XPath expressions:

>>> response.xpath('//title')

[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]

>>> response.xpath('//title/text()').extract_first()

'Quotes to Scrape'

XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted

to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the

shell.

While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the

structure, it can also look at the content. Using XPath, you’re able to select things like: select the link that contains the

text “Next Page”. This makes XPath very ﬁtting to the task of scraping, and we encourage you to learn XPath even if

you already know how to construct CSS selectors, it will make scraping much easier.

We won’t cover much of XPath here, but you can read more about using XPath with Scrapy Selectors here. To learn

more about XPath, we recommend this tutorial to learn XPath through examples, and this tutorial to learn “how to

think in XPath”.

Extracting quotes and authors

Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the

quotes from the web page.

Each quote in http://quotes.toscrape.com is represented by HTML elements that look like this:

<span class="text">“The world as we have created it is a process of our

thinking. It cannot be changed without changing our thinking.”</span>

<span>

by <small class="author">Albert Einstein</small>

<a href="/author/Albert-Einstein">(about)</a>

</span>

Tags:

<a class="tag" href="/tag/change/page/1/">change</a>

<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>

<a class="tag" href="/tag/thinking/page/1/">thinking</a>

2.3. Scrapy Tutorial 15

Scrapy Documentation, Release 1.4.0

<a class="tag" href="/tag/world/page/1/">world</a>

</div>

Let’s open up scrapy shell and play a bit to ﬁnd out how to extract the data we want:

$ scrapy shell 'http://quotes.toscrape.com'

We get a list of selectors for the quote HTML elements with:

>>> response.css("div.quote")

Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign

the ﬁrst selector to a variable, so that we can run our CSS selectors directly on a particular quote:

>>> quote = response.css("div.quote")[0]

Now, let’s extract title, author and the tags from that quote using the quote object we just created:

>>> title = quote.css("span.text::text").extract_first()

>>> title

'“The world as we have created it is a process of our thinking. It cannot be changed

˓→without changing our thinking.”'

>>> author = quote.css("small.author::text").extract_first()

>>> author

'Albert Einstein'

Given that the tags are a list of strings, we can use the .extract() method to get all of them:

>>> tags = quote.css("div.tags a.tag::text").extract()

>>> tags

['change', 'deep-thoughts', 'thinking', 'world']

Having ﬁgured out how to extract each bit, we can now iterate over all the quotes elements and put them together into

a Python dictionary:

>>> for quote in response.css("div.quote"):

... text = quote.css("span.text::text").extract_first()

... author = quote.css("small.author::text").extract_first()

... tags = quote.css("div.tags a.tag::text").extract()

... print(dict(text=text, author=author, tags=tags))

{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein

˓→', 'text': '“The world as we have created it is a process of our thinking. It

˓→cannot be changed without changing our thinking.”'}

{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our

˓→choices, Harry, that show what we truly are, far more than our abilities.”'}

... a few more of these, omitted for brevity

>>>

Extracting data in our spider

Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a

local ﬁle. Let’s integrate the extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use

the yield Python keyword in the callback, as you can see below:

16 Chapter 2. First steps

剩余280页未读，继续阅读

KeyesHsu

粉丝: 12
资源: 1

Scrapy 1.4.0 完整教程：从入门到实战关键概念详解

scrapy1.1 帮助文档

scrapy 中文教程

scrapy-0.24中文文档|中文教程

Scrapy-1.4.0.tar

Scrapy-1.4.0-py2.py3-none-any.whl

scrapy-dash:Dash的Scrapy文档集

scrapy document pdf - python爬虫框架scrapy文档

scrapy 文档--HTML版本

scrapy文档中文版

scrapy 安装文档

最新资源