Scrapy 0.22 官方API文档：爬虫开发必备

需积分: 9 9 浏览量更新于2024-07-22 收藏 927KB PDF 举报

"Scrapy 0.22 API 英文版" Scrapy 是一个开源的爬虫框架，广泛用于Web抓取和数据提取任务。0.22版本的官方API文档详细介绍了Scrapy的各项功能和组件，对于理解和使用Scrapy进行网络数据采集极其重要。文档的章节结构清晰，便于开发者按照需求查找信息。以下是一些关键知识点： 1. **获取帮助**：Scrapy提供了多种方式来获取帮助，包括社区论坛、邮件列表、IRC频道以及文档本身，以便用户在遇到问题时寻求解答。 2. **入门**：这部分介绍了Scrapy的基础知识，包括对Scrapy的快速概述，安装指南，Scrapy教程以及一系列示例项目，帮助初学者快速上手。 3. **基本概念**： - **命令行工具**：Scrapy提供了一系列命令行工具，如`scrapy startproject`, `scrapy crawl`等，用于创建项目、运行爬虫等操作。 - **Items**：Items是Scrapy中定义要抓取的数据结构，类似于Python字典。 - **Spiders**：Spider是Scrapy的核心，负责定义如何抓取网页和解析响应内容。 - **Selectors**：基于XPath或CSS的选择器，用于从HTML或XML文档中提取数据。 - **ItemLoaders**：ItemLoader用于更方便地填充Items，可以处理输入输出处理器和默认值。 - **Scrapy Shell**：交互式Shell环境，用于快速测试选择器和爬取逻辑。 - **Item Pipeline**：Pipeline处理从Spider提取的Items，执行清洗、验证、存储等操作。 - **Feed Exports**：允许将爬取数据导出到各种格式，如CSV、JSON等。 - **Link Extractors**：用于从HTML页面中提取链接，帮助定义爬取范围。 4. **内置服务**： - **Logging**：Scrapy提供日志系统，便于调试和监控爬虫运行情况。 - **Stats Collection**：统计收集器记录爬虫运行的各类指标，如请求次数、响应时间等。 - **发送电子邮件**：Scrapy可以配置发送邮件通知，报告爬虫状态或结果。 - **Telnet Console**：通过telnet接口，远程控制正在运行的Scrapy进程。 - **Web Service**：提供RESTful API以远程控制Scrapy爬虫。 5. **解决特定问题**： - **FAQ**：常见问题解答，针对使用中遇到的典型问题提供了答案。 - **调试Spiders**：提供了调试技巧和工具，如启用调试模式和设置断点。 - **Spiders Contracts**：自定义检查点，确保Spider行为符合预期。 - **最佳实践**：推荐的Scrapy使用方法和注意事项。 - **广度优先爬取**：指导如何进行大规模的网站爬取。 - **使用Firefox和Firebug进行抓取**：利用浏览器工具增强抓取能力。 - **内存泄漏调试**：识别和解决Scrapy的内存管理问题。 - **下载图片**：集成图片下载器，自动处理图片下载。 - **Ubuntu包**：在Ubuntu系统上安装和管理Scrapy的方法。 - **Scrapyd**：分布式爬虫管理服务，部署和管理多个Scrapy项目。 - **AutoThrottle**：自动调整请求速率，适应服务器限制。 - **基准测试**：测量和优化Scrapy的性能。 - **Jobs：暂停和恢复爬取**：支持在爬取过程中暂停和恢复任务。 - **Django Item**：与Django模型集成，简化数据存储。 6. **扩展Scrapy**：本部分讨论了Scrapy的架构和如何编写自定义中间件、扩展和下载器处理程序，以实现更复杂的功能和定制需求。 Scrapy 0.22 API文档为开发者提供了全面的参考，涵盖了从基础到高级的所有方面，无论你是初学者还是经验丰富的爬虫开发者，都能从中受益。

Scrapy Documentation, Release 0.22.0

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

filename = response.url.split("/")[-2]

open(filename, ’wb’).write(response.body)

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

The crawl dmoz command runs the spider for the dmoz.org domain. You will get an output similar to this:

2008-08-20 03:51:13-0300 [scrapy] INFO: Started project: dmoz

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled extensions: ...

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled downloader middlewares: ...

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled spider middlewares: ...

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled item pipelines: ...

2008-08-20 03:51:14-0300 [dmoz] INFO: Spider opened

2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)

2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)

2008-08-20 03:51:14-0300 [dmoz] INFO: Spider closed (finished)

Pay attention to the lines containing [dmoz], which corresponds to our spider. You can see a log line for each URL

deﬁned in start_urls. Because these URLs are the starting ones, they have no referrers, which is shown at the end

of the log line, where it says (referer: <None>).

But more interesting, as our parse method instructs, two ﬁles have been created: Books and Resources, with the

content of both URLs.

What just happened under the hood?

Scrapy creates scrapy.http.Request objects for each URL in the start_urls attribute of the Spider, and

assigns them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath or CSS expressions

called Scrapy Selectors. For more information about selectors and other extraction mechanisms see the Selectors

documentation.

Here are some examples of XPath expressions and their meanings:

• /html/head/title: selects the <title> element, inside the <head> element of a HTML document

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

• //td: selects all the <td> elements

12 Chapter 2. First steps

Scrapy Documentation, Release 0.22.0

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath we recommend this XPath tutorial.

For working with XPaths, Scrapy provides a Selector class, it is instantiated with a HtmlResponse or

XmlResponse object as ﬁrst argument.

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated to the root node, or the entire document.

Selectors have four basic methods (click on the method to see the complete API documentation).

• xpath(): returns a list of selectors, each of them representing the nodes selected by the xpath expression given

as argument.

• css(): returns a list of selectors, each of them representing the nodes selected by the CSS expression given as

argument.

• extract(): returns a unicode string with the selected data.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Note: Remember to always enclose urls with quotes when running Scrapy shell from command-line, otherwise urls

containing arguments (ie. & character) will not work.

This is what the shell looks like:

[ ... Scrapy log here ... ]

[s] Available Scrapy objects:

[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)

[s] sel <Selector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>

[s] item Item()

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] spider <Spider ’default’ at 0x1b6c2d0>

[s] Useful shortcuts:

[s] shelp() Print this help

[s] fetch(req_or_url) Fetch a new request or URL and update shell objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

The shell also pre-instantiate a selector for this response in variable sel, the selector automatically chooses the best

parsing rules (XML vs HTML) based on response’s type.

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 0.22.0

So let’s try it:

In [1]: sel.xpath(’//title’)

Out[1]: [<Selector (title) xpath=//title>]

In [2]: sel.xpath(’//title’).extract()

Out[2]: [u’<title>Open Directory - Computers: Programming: Languages: Python: Books</title>’]

In [3]: sel.xpath(’//title/text()’)

Out[3]: [<Selector (text) xpath=//title/text()>]

In [4]: sel.xpath(’//title/text()’).extract()

Out[4]: [u’Open Directory - Computers: Programming: Languages: Python: Books’]

In [5]: sel.xpath(’//title/text()’).re(’(\w+):’)

Out[5]: [u’Computers’, u’Programming’, u’Languages’, u’Python’]

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to ﬁgure out the XPaths you need to

use. However, inspecting the raw HTML code there could become a very tedious task. To make this an easier task,

you can use some Firefox extensions like Firebug. For more information see Using Firebug for scraping and Using

Firefox for scraping.

After inspecting the page source, you’ll ﬁnd that the web sites information is inside a <ul> element, in fact the second

<ul> element.

So we can select each <li> element belonging to the sites list with this code:

sel.xpath(’//ul/li’)

And from them, the sites descriptions:

sel.xpath(’//ul/li/text()’).extract()

The sites titles:

sel.xpath(’//ul/li/a/text()’).extract()

And the sites links:

sel.xpath(’//ul/li/a/@href’).extract()

As we said before, each .xpath() call returns a list of selectors, so we can concatenate further .xpath() calls to

dig deeper into a node. We are going to use that property here, so:

sites = sel.xpath(’//ul/li’)

for site in sites:

title = site.xpath(’a/text()’).extract()

link = site.xpath(’a/@href’).extract()

desc = site.xpath(’text()’).extract()

print title, link, desc

Note: For a more detailed description of using nested selectors, see Nesting selectors and Working with relative

XPaths in the Selectors documentation

Let’s add this code to our spider:

14 Chapter 2. First steps

Scrapy Documentation, Release 0.22.0

item = DmozItem()

item[’title’] = site.xpath(’a/text()’).extract()

item[’link’] = site.xpath(’a/@href’).extract()

item[’desc’] = site.xpath(’text()’).extract()

items.append(item)

return items

Note: You can ﬁnd a fully-functional variant of this spider in the dirbot project available at

https://github.com/scrapy/dirbot

Now doing a crawl on the dmoz.org domain yields DmozItem‘s:

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{’desc’: [u’ - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],

’link’: [u’http://gnosis.cx/TPiP/’],

’title’: [u’Text Processing in Python’]}

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{’desc’: [u’ - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n’],

’link’: [u’http://www.informit.com/store/product.aspx?isbn=0130211192’],

’title’: [u’XML Processing with Python’]}

2.3.4 Storing the scraped data

The simplest way to store the scraped data is by using the Feed exports, with the following command:

scrapy crawl dmoz -o items.json -t json

That will generate a items.json ﬁle containing all scraped items, serialized in JSON.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex

things with the scraped items, you can write an Item Pipeline. As with Items, a placeholder ﬁle for Item Pipelines

has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to

implement any item pipeline if you just want to store the scraped items.

2.3.5 Next steps

This tutorial covers only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What

else? section in Scrapy at a glance chapter for a quick overview of the most important ones.

Then, we recommend you continue by playing with an example project (see Examples), and then continue with the

section Basic concepts.

2.4 Examples

The best way to learn is with examples, and Scrapy is no exception. For this reason, there is an example Scrapy project

named dirbot, that you can use to play and learn more about Scrapy. It contains the dmoz spider described in the

tutorial.

This dirbot project is available at: https://github.com/scrapy/dirbot

It contains a README ﬁle with a detailed description of the project contents.

If you’re familiar with git, you can checkout the code. Otherwise you can download a tarball or zip ﬁle of the project

by clicking on Downloads.

16 Chapter 2. First steps

剩余198页未读，继续阅读

笑稔

粉丝: 0
资源: 1

Scrapy 0.22 官方API文档：爬虫开发必备

Scrapy的纯英文文档介绍文件下载

scrapy 0.22.3

Windows 8.1下配置Scrapy 0.22环境全攻略

Windows 8.1 64位环境下构建Scrapy 0.22教程

scrapy-api:使用scrapy创建api

Scrapy与API交互：实现数据的自动更新

scrapy api

scrapy0.2.4文档html版

Promise_Scrapy:适用于Scrapy请求的Promise API

scrapy 中文教程 文字版 最新

最新资源

scrapy 中文教程文字版最新