Scrapy 1.1新手指南与核心概念详解

需积分: 10 5 浏览量更新于2024-07-20 收藏 1.06MB PDF 举报

Scrapy 1.1参考手册是一份详细的文档，涵盖了Scrapy爬虫框架的核心概念和技术，适合初学者和经验丰富的开发者深入学习。这份手册于2016年7月8日发布，主要分为以下几个部分： 1. **入门与帮助** - "Getting help" 部分提供了获取Scrapy支持、文档和社区资源的途径，帮助用户解决初次使用中的问题。 2. **基础步骤** - "Scrapy at a glance" 介绍了Scrapy框架的基本架构和工作原理。 - "Installation guide" 是安装Scrapy的指南，确保读者能够顺利搭建开发环境。 - "Scrapy Tutorial" 提供了一个循序渐进的教程，引导用户从创建第一个爬虫开始。 - "Examples" 包含实用示例，展示如何处理常见的抓取任务。 3. **核心概念** - "Command line tool" 讲解命令行工具的使用，如`scrapy crawl`命令的用法。 - "Spiders" 部分是爬虫设计的核心，包括定义下载和解析网页的行为。 - "Selectors" 介绍了XPath和CSS选择器，用于从HTML中提取数据。 - "Items" 详细说明了如何定义项目抓取的数据结构。 - "ItemLoaders" 解释了如何处理数据清洗和转换的过程。 - "Scrapy shell" 是一个交互式工具，用于测试和调试选器和爬虫逻辑。 - "Item Pipeline" 描述了数据处理流水线，包括数据清洗、存储等操作。 - "Feed exports" 讨论了数据输出的方式，如CSV、JSON或数据库。 - "Requests and Responses" 涉及HTTP请求的发送和响应的处理。 - "Link Extractors" 针对网页链接的识别和提取方法。 - "Settings" 展示Scrapy框架的各种配置选项，以适应不同的抓取需求。 - "Exceptions" 介绍了可能遇到的错误类型及其处理方法。 4. **内置服务** - "Logging" 提供了详细的日志管理和记录机制。 - "Stats Collection" 记录和分析爬虫运行时的统计信息。 - "Sending email" 和 "Telnet Console" 分别涉及邮件通知和实时通信功能。 - "Web Service" 部分可能涉及Scrapy与Web服务的集成。 5. **解决特定问题** - "Frequently Asked Questions" 收集了常见问题和解答。 - "Debugging Spiders" 如何定位和修复代码错误。 - "Spiders Contracts" 关于编写高效、可维护的爬虫策略。 - "Common Practices" 提倡最佳实践，包括性能优化和代码规范。 - "Broad Crawls" 讨论大规模抓取策略和限制。 - "Using Firefox for scraping" 和 "Using Firebug for scraping" 提供浏览器工具的使用技巧。 - "Debugging memory leaks" 教授如何识别并避免内存泄漏。 - "Downloading and processing files and images" 探讨如何处理文件和图片的下载和处理。 - "Ubuntu packages" 提到在Ubuntu系统上的安装包管理。 - "Deploying Spiders" 讨论如何将爬虫部署到生产环境。 - "AutoThrottle extension" 提供自动限速扩展的介绍。 - "Benchmarking" 介绍了如何评估和优化爬虫性能。 - "Job" 可能是指任务调度和执行的相关内容。 Scrapy 1.1参考手册是Scrapy新手和老手必备的学习资料，通过深入了解和实践手册中的内容，用户可以构建高效、可维护的网络爬虫，并解决在实际项目中遇到的各种问题。

Scrapy Documentation, Release 1.1.0

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

filename = response.url.split("/")[-2] + '.html'

with open(filename, 'wb') as f:

f.write(response.body)

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

This command runs the spider with name dmoz that we’ve just added, that will send some requests for the dmoz.org

domain. You will get an output similar to this:

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)

2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Spider opened

2014-01-23 18:13:08-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

2014-01-23 18:13:09-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

2014-01-23 18:13:09-0400 [scrapy] INFO: Closing spider (finished)

Note: At the end you can see a log line for each URL deﬁned in start_urls. Because these URLs are the starting

ones, they have no referrers, which is shown at the end of the log line, where it says (referer: None).

Now, check the ﬁles in the current directory. You should notice two new ﬁles have been created: Books.html and

Resources.html, with the content for the respective URLs, as our parse method instructs.

What just happened under the hood?

Scrapy creates scrapy.Request objects for each URL in the start_urls attribute of the Spider, and assigns

them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath or CSS expressions

called Scrapy Selectors. For more information about selectors and other extraction mechanisms see the Selectors

documentation.

Here are some examples of XPath expressions and their meanings:

12 Chapter 2. First steps

Scrapy Documentation, Release 1.1.0

• /html/head/title: selects the <title> element, inside the <head> element of an HTML document

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

• //td: selects all the <td> elements

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath, we recommend this tutorial to learn XPath through examples, and this

tutorial to learn “how to think in XPath”.

Note: CSS vs XPath: you can go a long way extracting data from web pages using only CSS selectors. However,

XPath offers more power because besides navigating the structure, it can also look at the content: you’re able to select

things like: the link that contains the text ‘Next Page’. Because of this, we encourage you to learn about XPath even if

you already know how to construct CSS selectors.

For working with CSS and XPath expressions, Scrapy provides Selector class and convenient shortcuts to avoid

instantiating selectors yourself every time you need to select something from a response.

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated with the root node, or the entire document.

Selectors have four basic methods (click on the method to see the complete API documentation):

• xpath(): returns a list of selectors, each of which represents the nodes selected by the xpath expression given

as argument.

• css(): returns a list of selectors, each of which represents the nodes selected by the CSS expression given as

argument.

• extract(): returns a unicode string with the selected data.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls

containing arguments (ie. & character) will not work.

This is what the shell looks like:

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

[s] Available Scrapy objects:

[s] crawler <scrapy.crawler.Crawler object at 0x3636b50>

[s] item {}

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 1.1.0

[s] settings <scrapy.settings.Settings object at 0x3fadc50>

[s] spider <Spider 'default' at 0x3cebf50>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

More importantly response has a selector attribute which is an instance of Selector

class, instantiated with this particular response. You can run queries on response by calling

response.selector.xpath() or response.selector.css(). There are also some convenience short-

cuts like response.xpath() or response.css() which map directly to response.selector.xpath()

and response.selector.css().

So let’s try it:

In [1]: response.xpath('//title')

Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]

In [2]: response.xpath('//title').extract()

Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']

In [3]: response.xpath('//title/text()')

Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]

In [4]: response.xpath('//title/text()').extract()

Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']

In [5]: response.xpath('//title/text()').re('(\w+):')

Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to ﬁgure out the XPaths you need to

use. However, inspecting the raw HTML code there could become a very tedious task. To make it easier, you can

use Firefox Developer Tools or some Firefox extensions like Firebug. For more information see Using Firebug for

scraping and Using Firefox for scraping.

After inspecting the page source, you’ll ﬁnd that the web site’s information is inside a <ul> element, in fact the

second <ul> element.

So we can select each <li> element belonging to the site’s list with this code:

response.xpath('//ul/li')

And from them, the site’s descriptions:

response.xpath('//ul/li/text()').extract()

The site’s titles:

14 Chapter 2. First steps

Scrapy Documentation, Release 1.1.0

class DmozSpider(scrapy.Spider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

for sel in response.xpath('//ul/li'):

item = DmozItem()

item['title'] = sel.xpath('a/text()').extract()

item['link'] = sel.xpath('a/@href').extract()

item['desc'] = sel.xpath('text()').extract()

yield item

Note: You can ﬁnd a fully-functional variant of this spider in the dirbot project available at

https://github.com/scrapy/dirbot

Now crawling dmoz.org yields DmozItem objects:

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],

'link': [u'http://gnosis.cx/TPiP/'],

'title': [u'Text Processing in Python']}

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],

'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],

'title': [u'XML Processing with Python']}

2.3.4 Following links

Let’s say, instead of just scraping the stuff in Books and Resources pages, you want everything that is under the Python

directory.

Now that you know how to extract data from a page, why not extract the links for the pages you are interested, follow

them and then extract the data you want for all of them?

Here is a modiﬁcation to our spider that does just that:

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/",

]

def parse(self, response):

for href in response.css("ul.directory.dir-col > li > a::attr('href')"):

url = response.urljoin(href.extract())

yield scrapy.Request(url, callback=self.parse_dir_contents)

16 Chapter 2. First steps

剩余247页未读，继续阅读

dd5508301076

粉丝: 0
资源: 1

Scrapy 1.1新手指南与核心概念详解

scrapy1.1 帮助文档

Python2.7爬虫Scrapy1.1框架安装配置WIN版

scrapy框架参考文献

scrapy官方手册中文版

scrapy爬取quotes scrapy

scrapy startproject scrapy_dangdang

scrapy-redis使用

scrapy parse不执行_Scrapy学习之路(自言自语)参考其

scrapy-redis和gerapy

使用Scrapy框架爬取

最新资源