Scrapy 1.0.5 教程与核心概念

1.0.5

需积分: 0 134 浏览量更新于2024-07-20 收藏 1.01MB PDF 举报

"Scrapy官方文档1.0.5版发布" Scrapy是一个强大的Python爬虫框架，专注于数据抓取、处理以及网站抓取。这个文档是针对Scrapy 1.0.5版本的详细指南，由Scrapy开发团队于2016年4月7日发布。它为用户提供了从初学者入门到解决特定问题的全面知识。 1. **初识Scrapy** - **Scrapy概览**：介绍Scrapy的基本结构和工作流程，包括其核心组件如Spiders、Selectors、Items和Item Loaders等。 - **安装指南**：指导用户如何在不同操作系统上安装Scrapy，确保环境配置正确。 - **Scrapy教程**：一个逐步的教程，帮助用户编写第一个Scrapy爬虫项目。 - **示例**：提供多个实际示例，展示Scrapy在各种场景下的应用。 2. **基础概念** - **命令行工具**：解释如何通过命令行接口与Scrapy交互，创建、运行和管理项目。 - **Spiders**：Scrapy的核心组件，用于定义爬取规则和数据解析逻辑。 - **Selectors**：类似于XPath或CSS选择器，用于从HTML或XML文档中提取数据。 - **Items**：定义要抓取的数据结构，类似于Python字典。 - **Item Loaders**：用于填充Items的工具，简化了数据清洗和转换的过程。 - **Scrapy Shell**：一个交互式环境，用于快速测试和调试选择器。 - **Item Pipeline**：处理Items的流程，可以进行数据清洗、验证、存储等操作。 - **Feed Exports**：支持将爬取结果导出为多种格式，如JSON、CSV等。 - **Requests和Responses**：网络请求和响应对象，用于控制爬虫的网络交互。 - **Link Extractors**：自动提取网页中的链接，帮助构建爬虫的爬取路径。 - **Settings**：配置Scrapy项目的全局参数。 - **异常**：列出Scrapy中可能遇到的错误和异常，以及如何处理它们。 3. **内置服务** - **日志系统**：记录Scrapy运行过程中的信息，便于调试和监控。 - **统计收集**：收集爬虫运行时的统计信息，如下载速度、请求次数等。 - **发送邮件**：配置Scrapy在特定事件发生时发送通知邮件。 - **Telnet Console**：通过telnet客户端远程访问Scrapy的控制台。 - **Web服务**：启用一个HTTP接口，用以远程控制和监控Scrapy爬虫。 4. **解决特定问题** - **常见问题**：列出用户可能遇到的问题及其解决方案。 - **调试Spider**：提供调试Scrapy爬虫的方法和技巧。 - **Spider Contracts**：一种用于编写自验证的爬虫规则的方式，确保数据抓取质量。 - **最佳实践**：推荐的编码和项目组织方式。 - **广度优先爬取**：如何实现类似浏览器的全站爬取。 - **使用Firefox进行爬取**：结合Firefox浏览器进行更直观的网页分析。 - **使用Firebug进行爬取**：利用Firebug插件辅助分析网页结构和数据。 - **调试内存泄漏**：如何检测和解决Scrapy爬虫可能导致的内存问题。 - **下载和处理文件及图片**：指导如何处理非文本内容的下载和处理。 - **Ubuntu包管理**：在Ubuntu系统上安装和管理Scrapy的额外提示。 - **部署爬虫**：将Scrapy项目部署到生产环境的步骤。 - **AutoThrottle扩展**：自动调整爬取速率，避免对目标网站造成压力。 - **基准测试**：如何对Scrapy爬虫进行性能测试。该文档涵盖了从入门到高级的所有关键知识点，对于想要深入学习和使用Scrapy的人来说，是一份非常宝贵的参考资料。

Scrapy Documentation, Release 1.0.5

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

filename = response.url.split("/")[-2] + '.html'

with open(filename, 'wb') as f:

f.write(response.body)

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

This command runs the spider with name dmoz that we’ve just added, that will send some requests for the dmoz.org

domain. You will get an output similar to this:

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)

2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Spider opened

2014-01-23 18:13:08-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

2014-01-23 18:13:09-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

2014-01-23 18:13:09-0400 [scrapy] INFO: Closing spider (finished)

Note: At the end you can see a log line for each URL deﬁned in start_urls. Because these URLs are the starting

ones, they have no referrers, which is shown at the end of the log line, where it says (referer: None).

Now, check the ﬁles in the current directory. You should notice two new ﬁles have been created: Books.html and

Resources.html, with the content for the respective URLs, as our parse method instructs.

What just happened under the hood?

Scrapy creates scrapy.Request objects for each URL in the start_urls attribute of the Spider, and assigns

them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath or CSS expressions

called Scrapy Selectors. For more information about selectors and other extraction mechanisms see the Selectors

documentation.

Here are some examples of XPath expressions and their meanings:

12 Chapter 2. First steps

Scrapy Documentation, Release 1.0.5

• /html/head/title: selects the <title> element, inside the <head> element of an HTML document

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

• //td: selects all the <td> elements

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath, we recommend this tutorial to learn XPath through examples, and this

tutorial to learn “how to think in XPath”.

Note: CSS vs XPath: you can go a long way extracting data from web pages using only CSS selectors. However,

XPath offers more power because besides navigating the structure, it can also look at the content: you’re able to select

things like: the link that contains the text ‘Next Page’. Because of this, we encourage you to learn about XPath even if

you already know how to construct CSS selectors.

For working with CSS and XPath expressions, Scrapy provides Selector class and convenient shortcuts to avoid

instantiating selectors yourself every time you need to select something from a response.

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated with the root node, or the entire document.

Selectors have four basic methods (click on the method to see the complete API documentation):

• xpath(): returns a list of selectors, each of which represents the nodes selected by the xpath expression given

as argument.

• css(): returns a list of selectors, each of which represents the nodes selected by the CSS expression given as

argument.

• extract(): returns a unicode string with the selected data.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls

containing arguments (ie. & character) will not work.

This is what the shell looks like:

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

[s] Available Scrapy objects:

[s] crawler <scrapy.crawler.Crawler object at 0x3636b50>

[s] item {}

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 1.0.5

[s] settings <scrapy.settings.Settings object at 0x3fadc50>

[s] spider <Spider 'default' at 0x3cebf50>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

More importantly response has a selector attribute which is an instance of Selector

class, instantiated with this particular response. You can run queries on response by calling

response.selector.xpath() or response.selector.css(). There are also some convenience short-

cuts like response.xpath() or response.css() which map directly to response.selector.xpath()

and response.selector.css().

So let’s try it:

In [1]: response.xpath('//title')

Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]

In [2]: response.xpath('//title').extract()

Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']

In [3]: response.xpath('//title/text()')

Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]

In [4]: response.xpath('//title/text()').extract()

Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']

In [5]: response.xpath('//title/text()').re('(\w+):')

Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to ﬁgure out the XPaths you need to

use. However, inspecting the raw HTML code there could become a very tedious task. To make it easier, you can

use Firefox Developer Tools or some Firefox extensions like Firebug. For more information see Using Firebug for

scraping and Using Firefox for scraping.

After inspecting the page source, you’ll ﬁnd that the web site’s information is inside a <ul> element, in fact the

second <ul> element.

So we can select each <li> element belonging to the site’s list with this code:

response.xpath('//ul/li')

And from them, the site’s descriptions:

response.xpath('//ul/li/text()').extract()

The site’s titles:

14 Chapter 2. First steps

Scrapy Documentation, Release 1.0.5

class DmozSpider(scrapy.Spider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

for sel in response.xpath('//ul/li'):

item = DmozItem()

item['title'] = sel.xpath('a/text()').extract()

item['link'] = sel.xpath('a/@href').extract()

item['desc'] = sel.xpath('text()').extract()

yield item

Note: You can ﬁnd a fully-functional variant of this spider in the dirbot project available at

https://github.com/scrapy/dirbot

Now crawling dmoz.org yields DmozItem objects:

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],

'link': [u'http://gnosis.cx/TPiP/'],

'title': [u'Text Processing in Python']}

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],

'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],

'title': [u'XML Processing with Python']}

2.3.4 Following links

Let’s say, instead of just scraping the stuff in Books and Resources pages, you want everything that is under the Python

directory.

Now that you know how to extract data from a page, why not extract the links for the pages you are interested, follow

them and then extract the data you want for all of them?

Here is a modiﬁcation to our spider that does just that:

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/",

]

def parse(self, response):

for href in response.css("ul.directory.dir-col > li > a::attr('href')"):

url = response.urljoin(href.extract())

yield scrapy.Request(url, callback=self.parse_dir_contents)

16 Chapter 2. First steps

剩余231页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

chenyin1009

粉丝: 0

Scrapy 1.0.5 教程与核心概念

Scrapy-1.0.5中文文档

scrapy依赖文件包

Scrapy-1.0.5.tar.gz

ERROR: Could not find a version that satisfies the requirement scrapy (from versions: none)

scrapy爬取quotes scrapy

scrapy startproject scrapy_dangdang

scrapy和scrapy-redis

import scrapy class MovieItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass

pip install scrapy之后scrapy还是红名

给这段代码加上注释import scrapy class TaobaospiderItem(scrapy.Item): title = scrapy.Field() # 标题 price = scrapy.Field() # 价格 deal_count = scrapy.Field() # 销量 shop = scrapy.Field() # 店铺名称 location = scrapy.Field() # 店铺地址

最新资源