Scrapy框架详解与二次开发指南

5星 · 超过95%的资源需积分: 14 31 浏览量更新于2024-07-26 1 收藏 1.24MB PDF 举报

"scrapy说明文档，主要涵盖了scrapy框架的二次开发内容，包括获取帮助、初步使用等章节。" Scrapy是一个强大的Python爬虫框架，适用于网站抓取和结构化数据提取，广泛应用于数据挖掘、信息处理和历史归档等场景。尽管最初设计用于网页抓取（更具体地说是Web Scraping），但Scrapy也可以用于通过API（如Amazon Associates Web Services）提取数据，甚至作为通用的网络爬虫工具。在Scrapy的0.17.0版本文档中，首先介绍了如何获取帮助。如果你在使用过程中遇到问题，可以尝试查阅FAQ，它包含了一些常见问题的答案。若你需要特定的信息，可以通过genindex或modindex进行搜索。此外，你可以在scrapy-users邮件列表的存档中查找相关信息，或者直接在邮件列表上发布问题。Scrapy还设有#scrapy IRC频道，用户可以在这里提问并与其他开发者交流。如果发现可能的bug，可以在Scrapy的issue追踪器中报告。接下来的章节是“初步使用”。这里会详细阐述Scrapy的基本概念和工作流程。2.1节“Scrapy概览”解释了Scrapy作为一个应用框架如何支持网页爬取和结构化数据提取。Scrapy的核心组件包括引擎（Engine）、调度器（Scheduler）、下载器（Downloader）、下载器中间件（Downloader Middleware）、蜘蛛（Spiders）、物品（Items）、物品管道（Item Pipelines）以及链接提取器（Link Extractors）等。这些组件协同工作，使得Scrapy能够高效、灵活地爬取和处理网页内容。在实际操作中，开发者需要定义蜘蛛类来指定要爬取的网站和数据，编写物品类来描述要提取的数据结构，设置物品管道来清洗、验证和存储数据。下载器中间件则允许自定义下载行为，比如添加用户代理、处理cookies或处理重定向。Scrapy提供了丰富的API和配置选项，使开发者可以根据需求定制爬虫行为。文档中还会详细介绍如何创建和运行Scrapy项目，如何编写蜘蛛，如何处理请求和响应，以及如何调试和优化Scrapy爬虫。此外，文档可能还包括Scrapy的扩展性部分，如插件开发和自定义设置，以便于进行更复杂的二次开发。 Scrapy文档是一份详尽的指南，不仅覆盖了Scrapy的基础知识，还深入探讨了高级特性，对于任何想要利用Scrapy进行Web数据抓取的人来说都是宝贵的资源。通过学习这份文档，开发者可以高效地构建自己的爬虫解决方案，解决从简单到复杂的各种数据抓取任务。

Scrapy Documentation, Release 0.17.0

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

The crawl dmoz command runs the spider for the dmoz.org domain. You will get an output similar to this:

2008-08-20 03:51:13-0300 [scrapy] INFO: Started project: dmoz

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled extensions: ...

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled downloader middlewares: ...

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled spider middlewares: ...

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled item pipelines: ...

2008-08-20 03:51:14-0300 [dmoz] INFO: Spider opened

2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)

2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)

2008-08-20 03:51:14-0300 [dmoz] INFO: Spider closed (finished)

Pay attention to the lines containing [dmoz], which corresponds to our spider. You can see a log line for each URL

deﬁned in start_urls. Because these URLs are the starting ones, they have no referrers, which is shown at the end

of the log line, where it says (referer: <None>).

But more interesting, as our parse method instructs, two ﬁles have been created: Books and Resources, with the

content of both URLs.

What just happened under the hood?

Scrapy creates scrapy.http.Request objects for each URL in the start_urls attribute of the Spider, and

assigns them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath expressions called

XPath selectors. For more information about selectors and other extraction mechanisms see the XPath selectors docu-

mentation.

Here are some examples of XPath expressions and their meanings:

• /html/head/title: selects the <title> element, inside the <head> element of a HTML document

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

• //td: selects all the <td> elements

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath we recommend this XPath tutorial.

For working with XPaths, Scrapy provides a XPathSelector class, which comes in two ﬂavours,

HtmlXPathSelector (for HTML data) and XmlXPathSelector (for XML data). In order to use them you

must instantiate the desired class with a Response object.

12 Chapter 2. First steps

Scrapy Documentation, Release 0.17.0

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated to the root node, or the entire document.

Selectors have three methods (click on the method to see the complete API documentation).

• select(): returns a list of selectors, each of them representing the nodes selected by the xpath expression

given as argument.

• extract(): returns a unicode string with the data selected by the XPath selector.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

This is what the shell looks like:

[ ... Scrapy log here ... ]

[s] Available Scrapy objects:

[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)

[s] hxs <HtmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>

[s] item Item()

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] spider <BaseSpider ’default’ at 0x1b6c2d0>

[s] xxs <XmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>

[s] Useful shortcuts:

[s] shelp() Print this help

[s] fetch(req_or_url) Fetch a new request or URL and update shell objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

The shell also instantiates two selectors, one for HTML (in the hxs variable) and one for XML (in the xxs variable)

with this response. So let’s try them:

In [1]: hxs.select(’//title’)

Out[1]: [<HtmlXPathSelector (title) xpath=//title>]

In [2]: hxs.select(’//title’).extract()

Out[2]: [u’<title>Open Directory - Computers: Programming: Languages: Python: Books</title>’]

In [3]: hxs.select(’//title/text()’)

Out[3]: [<HtmlXPathSelector (text) xpath=//title/text()>]

In [4]: hxs.select(’//title/text()’).extract()

Out[4]: [u’Open Directory - Computers: Programming: Languages: Python: Books’]

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 0.17.0

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{’desc’: [u’ - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],

’link’: [u’http://gnosis.cx/TPiP/’],

’title’: [u’Text Processing in Python’]}

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{’desc’: [u’ - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n’],

’link’: [u’http://www.informit.com/store/product.aspx?isbn=0130211192’],

’title’: [u’XML Processing with Python’]}

2.3.4 Storing the scraped data

The simplest way to store the scraped data is by using the Feed exports, with the following command:

scrapy crawl dmoz -o items.json -t json

That will generate a items.json ﬁle containing all scraped items, serialized in JSON.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex

things with the scraped items, you can write an Item Pipeline. As with Items, a placeholder ﬁle for Item Pipelines

has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to

implement any item pipeline if you just want to store the scraped items.

2.3.5 Next steps

This tutorial covers only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What

else? section in Scrapy at a glance chapter for a quick overview of the most important ones.

Then, we recommend you continue by playing with an example project (see Examples), and then continue with the

section Basic concepts.

2.4 Examples

The best way to learn is with examples, and Scrapy is no exception. For this reason, there is an example Scrapy project

named dirbot, that you can use to play and learn more about Scrapy. It contains the dmoz spider described in the

tutorial.

This dirbot project is available at: https://github.com/scrapy/dirbot

It contains a README ﬁle with a detailed description of the project contents.

If you’re familiar with git, you can checkout the code. Otherwise you can download a tarball or zip ﬁle of the project

by clicking on Downloads.

The scrapy tag on Snipplr is used for sharing code snippets such as spiders, middlewares, extensions, or scripts. Feel

free (and encouraged!) to share any code there.

Scrapy at a glance Understand what Scrapy is and how it can help you.

Installation guide Get Scrapy installed on your computer.

Scrapy Tutorial Write your ﬁrst Scrapy project.

Examples Learn more by playing with a pre-made Scrapy project.

16 Chapter 2. First steps

剩余190页未读，继续阅读

IamLsz

粉丝: 47
资源: 65

Scrapy框架详解与二次开发指南

Scrapy文档1.4.0 文档

scrapy+splash官方文档

基于python3.6的微博爬虫（scrapy）文档详细+资料齐全.zip

scrapy1.1 帮助文档

python scrapy电子书开发文档

Scrapy 1.1 官方文档详解

Python爬虫框架Scrapy官方中文文档发布

scrapy官方文档在哪

基于Scrapy的爬虫文档详细+资料齐全.zip

基于python scrapy框架抓取豆瓣影视资料+源代码+文档说明

最新资源