Scrapy框架详解与开发指南

需积分: 10 155 浏览量更新于2024-07-22 收藏 937KB PDF 举报

"scrapy.pdf 是一份关于 Scrapy 框架的文档，涵盖了从入门到进阶的各种主题，包括安装、基本概念、内置服务、解决特定问题以及扩展 Scrapy 的方法。这份文档是 Scrapy 0.23.0 版本的，由 Scrapy 开发者于2014年6月3日发布。" Scrapy 是一个强大的Python爬虫框架，用于高效地抓取网页数据并进行处理。以下是对 Scrapy 各个关键知识点的详细解释： 1. **获取帮助**：Scrapy 提供了多种途径获取帮助，如官方文档、社区论坛和邮件列表，便于用户在遇到问题时寻求解决方案。 2. **初识 Scrapy**： - **Scrapy 一览**：介绍了 Scrapy 的核心组件，如 Spiders（爬虫）、Selectors（选择器）、Item 和 Item Pipeline。 - **安装指南**：提供了在不同操作系统上安装 Scrapy 的步骤。 - **Scrapy 教程**：引导用户逐步构建第一个 Scrapy 项目。 - **示例**：展示了实际应用场景，帮助用户理解和学习 Scrapy。 3. **基本概念**： - **命令行工具**：用于创建项目、启动爬虫等操作的命令行接口。 - **Items**：定义要抓取的数据结构，类似于 Python 字典。 - **Spiders**：爬虫类，定义如何抓取页面和提取数据。 - **Selectors**：基于XPath或CSS选择器的工具，用于从HTML或XML文档中提取数据。 - **Item Loaders**：用于填充 Item 的工具，方便处理字段的清洗和转换。 - **Scrapy Shell**：交互式环境，方便测试选择器和数据提取。 - **Item Pipeline**：处理 Item 的数据流管道，用于清洗、验证和存储数据。 - **Feed Exports**：将抓取的数据导出为各种格式（如JSON、CSV）。 - **Link Extractors**：从HTML中提取链接，用于下一步的爬取。 4. **内置服务**： - **日志**：提供日志记录功能，便于调试和监控爬虫运行状态。 - **统计收集**：统计爬虫运行中的各项指标，如请求次数、下载速度等。 - **发送邮件**：在特定事件发生时发送电子邮件通知。 - **Telnet Console**：通过 Telnet 连接控制爬虫运行。 - **Web Service**：提供 RESTful API，远程控制 Scrapy 项目。 5. **解决特定问题**： - **常见问题**：解答用户在使用 Scrapy 过程中可能遇到的问题。 - **调试蜘蛛**：提供调试技巧和工具，帮助定位代码错误。 - **Spider Contracts**：定义爬虫行为的协议，确保其正确运行。 - **最佳实践**：分享一些提高效率和稳定性的方法。 - **广度优先爬取**：指导如何实现广度优先的网页抓取策略。 - **使用 Firefox 和 Firebug 抓取**：利用浏览器工具辅助抓取和调试。 - **内存泄漏调试**：提供检测和修复内存泄漏的方法。 - **下载图片**：讲解如何配置 Scrapy 下载网页中的图片。 - **Ubuntu 包管理**：在 Ubuntu 上安装 Scrapy 的方法。 - **Scrapyd**：部署和管理 Scrapy 项目的服务器。 - **AutoThrottle 扩展**：自动调整请求速率，防止被目标网站封禁。 - **基准测试**：评估 Scrapy 的性能。 - **Jobs：暂停与恢复爬取**：支持在爬虫运行中暂停和恢复。 - **Django Item**：结合 Django 模型使用 Scrapy。 6. **扩展 Scrapy**： - **架构概述**：介绍 Scrapy 的模块化设计，便于定制和扩展。通过这份文档，用户可以全面了解 Scrapy 并掌握其核心功能，进一步开发出高效的网络爬虫项目。

Scrapy Documentation, Release 0.23.0

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

filename = response.url.split("/")[-2]

with open(filename, ’wb’) as f:

f.write(response.body)

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

The crawl dmoz command runs the spider for the dmoz.org domain. You will get an output similar to this:

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)

2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...

2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened

2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

2014-01-23 18:13:09-0400 [dmoz] INFO: Closing spider (finished)

Pay attention to the lines containing [dmoz], which corresponds to our spider. You can see a log line for each URL

deﬁned in start_urls. Because these URLs are the starting ones, they have no referrers, which is shown at the end

of the log line, where it says (referer: None).

But more interesting, as our parse method instructs, two ﬁles have been created: Books and Resources, with the

content of both URLs.

What just happened under the hood?

Scrapy creates scrapy.Request objects for each URL in the start_urls attribute of the Spider, and assigns

them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath or CSS expressions

called Scrapy Selectors. For more information about selectors and other extraction mechanisms see the Selectors

documentation.

Here are some examples of XPath expressions and their meanings:

• /html/head/title: selects the <title> element, inside the <head> element of a HTML document

12 Chapter 2. First steps

Scrapy Documentation, Release 0.23.0

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

• //td: selects all the <td> elements

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath we recommend this XPath tutorial.

For working with XPaths, Scrapy provides Selector class and convenient shortcuts to avoid instantiating selectors

yourself everytime you need to select something from a response.

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated with the root node, or the entire document.

Selectors have four basic methods (click on the method to see the complete API documentation):

• xpath(): returns a list of selectors, each of them representing the nodes selected by the xpath expression given

as argument.

• css(): returns a list of selectors, each of them representing the nodes selected by the CSS expression given as

argument.

• extract(): returns a unicode string with the selected data.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Note: Remember to always enclose urls with quotes when running Scrapy shell from command-line, otherwise urls

containing arguments (ie. & character) will not work.

This is what the shell looks like:

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [default] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

[s] Available Scrapy objects:

[s] crawler <scrapy.crawler.Crawler object at 0x3636b50>

[s] item {}

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] settings <CrawlerSettings module=None>

[s] spider <Spider ’default’ at 0x3cebf50>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 0.23.0

More important, if you type response.selector you will access a selector object you can use to

query the response, and convenient shortcuts like response.xpath() and response.css() mapping to

response.selector.xpath() and response.selector.css()

So let’s try it:

In [1]: response.xpath(’//title’)

Out[1]: [<Selector xpath=’//title’ data=u’<title>Open Directory - Computers: Progr’>]

In [2]: response.xpath(’//title’).extract()

Out[2]: [u’<title>Open Directory - Computers: Programming: Languages: Python: Books</title>’]

In [3]: response.xpath(’//title/text()’)

Out[3]: [<Selector xpath=’//title/text()’ data=u’Open Directory - Computers: Programming:’>]

In [4]: response.xpath(’//title/text()’).extract()

Out[4]: [u’Open Directory - Computers: Programming: Languages: Python: Books’]

In [5]: response.xpath(’//title/text()’).re(’(\w+):’)

Out[5]: [u’Computers’, u’Programming’, u’Languages’, u’Python’]

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to ﬁgure out the XPaths you need to

use. However, inspecting the raw HTML code there could become a very tedious task. To make this an easier task,

you can use some Firefox extensions like Firebug. For more information see Using Firebug for scraping and Using

Firefox for scraping.

After inspecting the page source, you’ll ﬁnd that the web sites information is inside a <ul> element, in fact the second

<ul> element.

So we can select each <li> element belonging to the sites list with this code:

sel.xpath(’//ul/li’)

And from them, the sites descriptions:

sel.xpath(’//ul/li/text()’).extract()

The sites titles:

sel.xpath(’//ul/li/a/text()’).extract()

And the sites links:

sel.xpath(’//ul/li/a/@href’).extract()

As we’ve said before, each .xpath() call returns a list of selectors, so we can concatenate further .xpath() calls

to dig deeper into a node. We are going to use that property here, so:

for sel in response.xpath(’//ul/li’)

title = sel.xpath(’a/text()’).extract()

link = sel.xpath(’a/@href’).extract()

desc = sel.xpath(’text()’).extract()

print title, link, desc

Note: For a more detailed description of using nested selectors, see Nesting selectors and Working with relative

14 Chapter 2. First steps

Scrapy Documentation, Release 0.23.0

Note: You can ﬁnd a fully-functional variant of this spider in the dirbot project available at

https://github.com/scrapy/dirbot

Now doing a crawl on the dmoz.org domain yields DmozItem objects:

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{’desc’: [u’ - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],

’link’: [u’http://gnosis.cx/TPiP/’],

’title’: [u’Text Processing in Python’]}

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{’desc’: [u’ - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n’],

’link’: [u’http://www.informit.com/store/product.aspx?isbn=0130211192’],

’title’: [u’XML Processing with Python’]}

2.3.4 Storing the scraped data

The simplest way to store the scraped data is by using the Feed exports, with the following command:

scrapy crawl dmoz -o items.json -t json

That will generate a items.json ﬁle containing all scraped items, serialized in JSON.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex

things with the scraped items, you can write an Item Pipeline. As with Items, a placeholder ﬁle for Item Pipelines

has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to

implement any item pipelines if you just want to store the scraped items.

2.3.5 Next steps

This tutorial covers only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What

else? section in Scrapy at a glance chapter for a quick overview of the most important ones.

Then, we recommend you continue by playing with an example project (see Examples), and then continue with the

section Basic concepts.

2.4 Examples

The best way to learn is with examples, and Scrapy is no exception. For this reason, there is an example Scrapy project

named dirbot, that you can use to play and learn more about Scrapy. It contains the dmoz spider described in the

tutorial.

This dirbot project is available at: https://github.com/scrapy/dirbot

It contains a README ﬁle with a detailed description of the project contents.

If you’re familiar with git, you can checkout the code. Otherwise you can download a tarball or zip ﬁle of the project

by clicking on Downloads.

The scrapy tag on Snipplr is used for sharing code snippets such as spiders, middlewares, extensions, or scripts. Feel

free (and encouraged!) to share any code there.

Scrapy at a glance Understand what Scrapy is and how it can help you.

Installation guide Get Scrapy installed on your computer.

16 Chapter 2. First steps

剩余200页未读，继续阅读

qq_16133721

粉丝: 1
资源: 1

Scrapy框架详解与开发指南

java抓取网页三种方式

scrapy官方手册中文版.pdf

scrapy document pdf - python爬虫框架scrapy文档

精通Python爬虫框架Scrapy.pdf

开源python网络爬虫框架Scrapy.pdf

大数据爬取、清洗与可视化教程课件第六章中型网络爬虫框架Scrapy.pdf

scrapy1.6.pdf

开源python网络爬虫框架Scrapy定义.pdf

开源python网络爬虫框架Scrapy资料.pdf

开源python网络爬虫框架Scrapy借鉴.pdf

最新资源