Scrapy入门与核心概念详解：从英文文档看爬虫开发

需积分: 9 185 浏览量更新于2024-07-19 收藏 1.06MB PDF 举报

Scrapy是Python编写的强大、灵活的网络爬虫框架，用于高效地抓取网页数据。这份文档是Scrapy官方的1.1.1版本，为初学者提供了全面且清晰的教程，同时也对高级概念和技术进行了深入讲解。以下是部分章节的关键知识点概览： 1. **入门帮助**：章节指导读者如何获取Scrapy的帮助，包括官方文档、社区论坛、和GitHub仓库等资源，以便于在遇到问题时能快速找到解决方案。 2. **基础步骤**： - **Scrapy概览**：介绍了Scrapy的主要组件，如Spider（爬虫）、Selector（选择器）和ItemLoader，以及它们在爬取过程中的作用。 - **安装指南**：详细说明了如何安装和配置Scrapy环境，确保新用户能够顺利启动项目。 - **教程实践**：通过示例项目让学习者了解如何编写基本的爬虫，包括设置start_urls、解析响应和处理Item。 3. **基本概念**： - **命令行工具**：展示了如何使用Scrapy shell进行交互式调试和数据验证。 - **蜘蛛设计**：讲解了如何定义和组织Scrapy Spider，包括请求管理、中间件和下载策略。 - **选择器技术**：介绍XPath和CSS选择器，用于从HTML文档中提取所需的数据。 - **Item和ItemLoader**：阐述数据模型和如何处理抓取到的数据，以及ItemLoader的使用，以简化数据处理流程。 - **管道系统**：如何定义Item Pipeline来清洗、存储和进一步处理抓取的数据。 - **数据输出**：介绍了不同的数据导出方式，如CSV、JSON或数据库存储。 - **请求与响应**：理解HTTP请求和响应的工作原理，以及Scrapy如何处理这些请求。 4. **内置服务**： - **日志管理**：Scrapy的内置logging系统，有助于跟踪和记录爬虫运行过程中的信息。 - **统计收集**：统计信息对于监控爬虫性能至关重要，包括爬取速度、成功率等。 - **电子邮件发送**：如何通过Scrapy发送邮件通知，如爬虫状态更新或错误报告。 - **telnet控制台**：提供了一种与爬虫实时交互的方式。 - **Web服务接口**：Scrapy支持通过Web服务API来集成其他系统。 5. **解决特定问题**： - **常见问题解答**：涵盖了一些常见问题的解答，如处理JavaScript渲染的页面、处理动态内容等。 - **调试技巧**：提供了如何识别和解决爬虫中的问题，如错误处理和日志分析。 - **合同设计**：强调编写规范的Spider，以确保可维护性和复用性。 - **最佳实践**：列举了高效的代码编写和项目组织方法。 - **宽广的爬取范围**：讨论如何处理大规模或深层次的网站结构。 - **浏览器辅助**：指导如何利用Firefox和Firebug进行更精细的网页分析。 - **内存泄漏检测**：提供工具和技术来识别和优化内存使用。 - **Ubuntu包管理**：说明如何在Ubuntu系统上安装和管理Scrapy。 - **部署指南**：涉及部署Scrapy到生产环境，以及扩展选项如AutoThrottle。 - **性能测试**：探讨如何衡量和优化爬虫的性能。这份Scrapy原始文档为初学者提供了详尽的指导，从入门到进阶，涵盖了Scrapy的核心功能和常见问题解决方法，是一份不可多得的学习资源。

Scrapy Documentation, Release 1.1.1

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

filename = response.url.split("/")[-2] + '.html'

with open(filename, 'wb') as f:

f.write(response.body)

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

This command runs the spider with name dmoz that we’ve just added, that will send some requests for the dmoz.org

domain. You will get an output similar to this:

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)

2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Spider opened

2014-01-23 18:13:08-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

2014-01-23 18:13:09-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

2014-01-23 18:13:09-0400 [scrapy] INFO: Closing spider (finished)

Note: At the end you can see a log line for each URL deﬁned in start_urls. Because these URLs are the starting

ones, they have no referrers, which is shown at the end of the log line, where it says (referer: None).

Now, check the ﬁles in the current directory. You should notice two new ﬁles have been created: Books.html and

Resources.html, with the content for the respective URLs, as our parse method instructs.

What just happened under the hood?

Scrapy creates scrapy.Request objects for each URL in the start_urls attribute of the Spider, and assigns

them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath or CSS expressions

called Scrapy Selectors. For more information about selectors and other extraction mechanisms see the Selectors

documentation.

Here are some examples of XPath expressions and their meanings:

12 Chapter 2. First steps

Scrapy Documentation, Release 1.1.1

• /html/head/title: selects the <title> element, inside the <head> element of an HTML document

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

• //td: selects all the <td> elements

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath, we recommend this tutorial to learn XPath through examples, and this

tutorial to learn “how to think in XPath”.

Note: CSS vs XPath: you can go a long way extracting data from web pages using only CSS selectors. However,

XPath offers more power because besides navigating the structure, it can also look at the content: you’re able to select

things like: the link that contains the text ‘Next Page’. Because of this, we encourage you to learn about XPath even if

you already know how to construct CSS selectors.

For working with CSS and XPath expressions, Scrapy provides Selector class and convenient shortcuts to avoid

instantiating selectors yourself every time you need to select something from a response.

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated with the root node, or the entire document.

Selectors have four basic methods (click on the method to see the complete API documentation):

• xpath(): returns a list of selectors, each of which represents the nodes selected by the xpath expression given

as argument.

• css(): returns a list of selectors, each of which represents the nodes selected by the CSS expression given as

argument.

• extract(): returns a unicode string with the selected data.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls

containing arguments (ie. & character) will not work.

This is what the shell looks like:

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

[s] Available Scrapy objects:

[s] crawler <scrapy.crawler.Crawler object at 0x3636b50>

[s] item {}

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 1.1.1

[s] settings <scrapy.settings.Settings object at 0x3fadc50>

[s] spider <Spider 'default' at 0x3cebf50>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

More importantly response has a selector attribute which is an instance of Selector

class, instantiated with this particular response. You can run queries on response by calling

response.selector.xpath() or response.selector.css(). There are also some convenience short-

cuts like response.xpath() or response.css() which map directly to response.selector.xpath()

and response.selector.css().

So let’s try it:

In [1]: response.xpath('//title')

Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]

In [2]: response.xpath('//title').extract()

Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']

In [3]: response.xpath('//title/text()')

Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]

In [4]: response.xpath('//title/text()').extract()

Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']

In [5]: response.xpath('//title/text()').re('(\w+):')

Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to ﬁgure out the XPaths you need to

use. However, inspecting the raw HTML code there could become a very tedious task. To make it easier, you can

use Firefox Developer Tools or some Firefox extensions like Firebug. For more information see Using Firebug for

scraping and Using Firefox for scraping.

After inspecting the page source, you’ll ﬁnd that the web site’s information is inside a <ul> element, in fact the

second <ul> element.

So we can select each <li> element belonging to the site’s list with this code:

response.xpath('//ul/li')

And from them, the site’s descriptions:

response.xpath('//ul/li/text()').extract()

The site’s titles:

14 Chapter 2. First steps

Scrapy Documentation, Release 1.1.1

class DmozSpider(scrapy.Spider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

for sel in response.xpath('//ul/li'):

item = DmozItem()

item['title'] = sel.xpath('a/text()').extract()

item['link'] = sel.xpath('a/@href').extract()

item['desc'] = sel.xpath('text()').extract()

yield item

Note: You can ﬁnd a fully-functional variant of this spider in the dirbot project available at

https://github.com/scrapy/dirbot

Now crawling dmoz.org yields DmozItem objects:

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],

'link': [u'http://gnosis.cx/TPiP/'],

'title': [u'Text Processing in Python']}

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],

'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],

'title': [u'XML Processing with Python']}

2.3.4 Following links

Let’s say, instead of just scraping the stuff in Books and Resources pages, you want everything that is under the Python

directory.

Now that you know how to extract data from a page, why not extract the links for the pages you are interested, follow

them and then extract the data you want for all of them?

Here is a modiﬁcation to our spider that does just that:

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/",

]

def parse(self, response):

for href in response.css("ul.directory.dir-col > li > a::attr('href')"):

url = response.urljoin(href.extract())

yield scrapy.Request(url, callback=self.parse_dir_contents)

16 Chapter 2. First steps

剩余249页未读，继续阅读

Summers_lly

粉丝: 4
资源: 5

Scrapy入门与核心概念详解：从英文文档看爬虫开发

scrapy学习文件

scrapy下载图片

Python爬虫框架scrapy实现的文件下载功能示例

python scrapy json文件重排序

learning_scrapy:精通python爬虫框架scrapy

Scrapy ImagesPipeline下载图片

ScrapyPyppeteerDeprecated:Scrapy Pyppeteer演示

8.scrapy入门使用1

python爬虫scrapy图书分类实例讲解

Python Scrapy框架的新特性介绍

最新资源