Scrapy 1.3.3 爬虫框架入门指南

需积分: 10 100 浏览量更新于2024-07-16 收藏 1.11MB PDF 举报

"scrapy1.3.3手册.pdf" Scrapy是一个强大的、纯Python编写的爬虫框架，它基于Twisted异步处理框架构建，为开发者提供了便捷的方式来抓取网页内容和各种图片。Scrapy的设计允许用户只需定制几个核心模块，就能快速地创建自己的爬虫项目。手册中的主要内容分为以下几个部分： 1. **First Steps**： - **Scrapy概述**：介绍Scrapy的基本概念，包括它的设计目标和核心组件。 - **安装指南**：详细说明了如何在不同的操作系统上安装Scrapy框架。 - **Scrapy教程**：提供了一个逐步的教程，帮助初学者理解Scrapy的工作流程。 - **示例**：包含了多个实际的代码示例，用于演示Scrapy的不同应用场景。 2. **基本概念**： - **命令行工具**：解释了如何使用Scrapy命令行工具来创建、运行和管理项目。 - **Spider**：Spider是Scrapy的核心，负责定义如何抓取网站数据和处理抓取到的数据。 - **选择器（Selectors）**：Scrapy使用XPath或CSS选择器来解析HTML和XML文档，提取所需信息。 - **Items**：Items是Scrapy中定义要抓取的数据结构，类似于Python字典。 - **Item Loaders**：Item Loaders是处理Item数据的工具，方便地从选择器中提取和清洗数据。 - **Scrapy Shell**：一个交互式环境，用于测试和调试选择器和数据提取。 - **Item Pipeline**：处理抓取的数据流，例如清洗、验证和持久化。 - **Feed Exports**：提供了一种将爬取数据导出为各种格式（如JSON, CSV等）的功能。 - **Requests and Responses**：Requests代表网络请求，Responses则是收到的服务器响应，它们是Scrapy进行网络交互的基础。 - **Link Extractors**：用于从页面中提取链接，用于爬虫的进一步遍历。 - **设置（Settings）**：配置Scrapy项目的全局行为。 - **异常（Exceptions）**：列出Scrapy框架中可能出现的异常类型及其处理。 3. **内置服务**： - **日志（Logging）**：Scrapy的日志系统可以帮助开发者跟踪和调试爬虫运行过程。 - **统计收集（Stats Collection）**：收集有关爬虫运行的统计信息，如请求和响应的数量。 - **发送邮件（Sending email）**：在特定条件下，如爬虫完成或出现错误时，发送通知邮件。 - **Telnet控制台（Telnet Console）**：通过telnet连接到Scrapy的内部接口进行实时监控。 - **Web服务**：提供一个HTTP接口来远程控制和监控Scrapy爬虫。 4. **解决特定问题**： - **常见问题（Frequently Asked Questions）**：解答了用户在使用Scrapy时可能遇到的问题。 - **调试蜘蛛（Debugging Spiders）**：提供了调试Scrapy爬虫的技巧和方法。 - **Spider Contracts**：一种保证爬虫行为的约定，可以自动测试爬虫是否遵循这些约定。 - **最佳实践（Common Practices）**：分享了一些编写高效和可维护的Scrapy爬虫的建议。 - **广度优先爬取（Broad Crawls）**：如何处理大型网站的广度优先爬取策略。 - **使用Firefox进行抓取**：指导如何使用Firefox浏览器进行网页抓取。 - **使用Firebug进行抓取**：Firebug是一款用于网页调试的工具，可以帮助识别网页结构以便更好地抓取数据。 - **调试内存泄漏**：提供检查和解决Scrapy爬虫内存泄漏问题的方法。 - **下载和处理文件及图片**：如何管理和下载爬取过程中遇到的文件和图片。 - **部署爬虫**：讲解如何将Scrapy项目部署到生产环境。 - **AutoThrottle扩展**：自动调整请求速率，防止对目标网站造成过大的负载。 - **基准测试（Benchmarking）**：评估和优化Scrapy爬虫性能的手段。 - **Jobs：暂停与恢复爬取**：支持暂停和恢复爬虫的执行，以适应网络不稳定或资源限制的情况。手册详细阐述了Scrapy的各个方面，无论是初学者还是有经验的开发者，都能从中获取到宝贵的知识，提升爬虫开发技能。

Scrapy Documentation, Release 1.3.3

'http://quotes.toscrape.com/page/2/',

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

page = response.url.split("/")[-2]

filename = 'quotes-%s.html' % page

with open(filename, 'wb') as f:

f.write(response.body)

self.log('Saved file %s' % filename)

As you can see, our Spider subclasses scrapy.Spider and deﬁnes some attributes and methods:

• name: identiﬁes the Spider. It must be unique within a project, that is, you can’t set the same name for different

Spiders.

• start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator

function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from

these initial requests.

• parse(): a method that will be called to handle the response downloaded for each of the requests made.

The response parameter is an instance of TextResponse that holds the page content and has further helpful

methods to handle it.

The parse() method usually parses the response, extracting the scraped data as dicts and also ﬁnding new

URLs to follow and creating new requests (Request) from them.

How to run our spider

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl quotes

This command runs the spider with name quotes that we’ve just added, that will send some requests for the

quotes.toscrape.com domain. You will get an output similar to this:

... (omitted for brevity)

2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened

2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/

˓→min), scraped 0 items (at 0 items/min)

2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.

˓→0.0.1:6023

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.

˓→toscrape.com/robots.txt> (referer: None)

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.

˓→toscrape.com/page/1/> (referer: None)

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.

˓→toscrape.com/page/2/> (referer: None)

2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html

2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html

2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)

...

Now, check the ﬁles in the current directory. You should notice that two new ﬁles have been created: quotes-1.html

and quotes-2.html, with the content for the respective URLs, as our parse method instructs.

12 Chapter 2. First steps

Scrapy Documentation, Release 1.3.3

[ ... Scrapy log here ... ]

2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.

˓→toscrape.com/page/1/> (referer: None)

[s] Available Scrapy objects:

[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)

[s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>

[s] item {}

[s] request <GET http://quotes.toscrape.com/page/1/>

[s] response <200 http://quotes.toscrape.com/page/1/>

[s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>

[s] spider <DefaultSpider 'default' at 0x7fa91c8af990>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

>>>

Using the shell, you can try selecting elements using CSS with the response object:

>>> response.css('title')

[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

The result of running response.css('title') is a list-like object called SelectorList, which represents a

list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to ﬁne-grain

the selection or extract the data.

To extract the text from the title above, you can do:

>>> response.css('title::text').extract()

['Quotes to Scrape']

There are two things to note here: one is that we’ve added ::text to the CSS query, to mean we want to select

only the text elements directly inside <title> element. If we don’t specify ::text, we’d get the full title element,

including its tags:

>>> response.css('title').extract()

['<title>Quotes to Scrape</title>']

The other thing is that the result of calling .extract() is a list, because we’re dealing with an instance of

SelectorList. When you know you just want the ﬁrst result, as in this case, you can do:

>>> response.css('title::text').extract_first()

'Quotes to Scrape'

As an alternative, you could’ve written:

>>> response.css('title::text')[0].extract()

'Quotes to Scrape'

However, using .extract_first() avoids an IndexError and returns None when it doesn’t ﬁnd any element

matching the selection.

There’s a lesson here: for most scraping code, you want it to be resilient to errors due to things not being found on a

page, so that even if some parts fail to be scraped, you can at least get some data.

Besides the extract() and extract_first() methods, you can also use the re() method to extract using

regular expressions:

14 Chapter 2. First steps

Scrapy Documentation, Release 1.3.3

>>> response.css('title::text').re(r'Quotes.

['Quotes to Scrape']

>>> response.css('title::text').re(r'Q\w+')

['Quotes']

>>> response.css('title::text').re(r'(\w+) to (\w+)')

['Quotes', 'Scrape']

In order to ﬁnd the proper CSS selectors to use, you might ﬁnd useful opening the response page from the shell in

your web browser using view(response). You can use your browser developer tools or extensions like Firebug

(see sections about Using Firebug for scraping and Using Firefox for scraping).

Selector Gadget is also a nice tool to quickly ﬁnd CSS selector for visually selected elements, which works in many

browsers.

XPath: a brief intro

Besides CSS, Scrapy selectors also support using XPath expressions:

>>> response.xpath('//title')

[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]

>>> response.xpath('//title/text()').extract_first()

'Quotes to Scrape'

XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted

to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the

shell.

While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the

structure, it can also look at the content. Using XPath, you’re able to select things like: select the link that contains the

text “Next Page”. This makes XPath very ﬁtting to the task of scraping, and we encourage you to learn XPath even if

you already know how to construct CSS selectors, it will make scraping much easier.

We won’t cover much of XPath here, but you can read more about using XPath with Scrapy Selectors here. To learn

more about XPath, we recommend this tutorial to learn XPath through examples, and this tutorial to learn “how to

think in XPath”.

Extracting quotes and authors

Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the

quotes from the web page.

Each quote in http://quotes.toscrape.com is represented by HTML elements that look like this:

<span class="text">“The world as we have created it is a process of our

thinking. It cannot be changed without changing our thinking.”</span>

<span>

by <small class="author">Albert Einstein</small>

<a href="/author/Albert-Einstein">(about)</a>

</span>

Tags:

<a class="tag" href="/tag/change/page/1/">change</a>

<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>

<a class="tag" href="/tag/thinking/page/1/">thinking</a>

2.3. Scrapy Tutorial 15

Scrapy Documentation, Release 1.3.3

<a class="tag" href="/tag/world/page/1/">world</a>

</div>

Let’s open up scrapy shell and play a bit to ﬁnd out how to extract the data we want:

$ scrapy shell 'http://quotes.toscrape.com'

We get a list of selectors for the quote HTML elements with:

>>> response.css("div.quote")

Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign

the ﬁrst selector to a variable, so that we can run our CSS selectors directly on a particular quote:

>>> quote = response.css("div.quote")[0]

Now, let’s extract title, author and the tags from that quote using the quote object we just created:

>>> title = quote.css("span.text::text").extract_first()

>>> title

'“The world as we have created it is a process of our thinking. It cannot be changed

˓→without changing our thinking.”'

>>> author = quote.css("small.author::text").extract_first()

>>> author

'Albert Einstein'

Given that the tags are a list of strings, we can use the .extract() method to get all of them:

>>> tags = quote.css("div.tags a.tag::text").extract()

>>> tags

['change', 'deep-thoughts', 'thinking', 'world']

Having ﬁgured out how to extract each bit, we can now iterate over all the quotes elements and put them together into

a Python dictionary:

>>> for quote in response.css("div.quote"):

... text = quote.css("span.text::text").extract_first()

... author = quote.css("small.author::text").extract_first()

... tags = quote.css("div.tags a.tag::text").extract()

... print(dict(text=text, author=author, tags=tags))

{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein

˓→', 'text': '“The world as we have created it is a process of our thinking. It

˓→cannot be changed without changing our thinking.”'}

{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our

˓→choices, Harry, that show what we truly are, far more than our abilities.”'}

... a few more of these, omitted for brevity

>>>

Extracting data in our spider

Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a

local ﬁle. Let’s integrate the extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use

the yield Python keyword in the callback, as you can see below:

16 Chapter 2. First steps

剩余271页未读，继续阅读

二爷记

粉丝: 1420

Scrapy 1.3.3 爬虫框架入门指南

scrapy中文教程（官方）

scrapy 中文教程

scrapy 中文教程 文字版 最新

Scrapy-1.3.3.tar.gz

scrapy框架概览.pdf

scrapy-1.4.pdf

Scrapy框架安装.pdf

Scrapy分布式原理.pdf

Scrapy爬虫框架.pdf

Python Scrapy参考文档.pdf

最新资源

scrapy 中文教程文字版最新