Scrapy入门教程：从基础到实践

需积分: 10 152 浏览量更新于2024-07-20 2 收藏 1.05MB PDF 举报

"Scrapy教程，包括Scrapy的基本概念、安装指南、爬虫、选择器、物品、项目管道、链接提取器、内置服务、解决特定问题等全面内容。" Scrapy是一个强大的Python爬虫框架，广泛用于数据抓取和网页解析。本教程涵盖了从初学者到进阶使用者所需的所有关键知识点。 1. **初识Scrapy** - Scrapy at a glance: 提供了Scrapy框架的整体概览，帮助用户快速理解其核心组件和工作流程。 - Installation guide: 指导用户如何在不同的操作系统上安装Scrapy，确保环境配置正确。 2. **基本概念** - Command line tool: 介绍Scrapy命令行工具的使用，如创建新项目、运行爬虫等。 - Spiders: 解释了Scrapy中的爬虫概念，它是执行爬取任务的核心部分。 - Selectors: 提供XPath和CSS选择器知识，用于从HTML或XML文档中提取数据。 - Items: 描述了如何定义数据结构，用于存储和处理爬取的数据。 - Item Loaders: 介绍如何使用Item Loaders更高效地填充Items。 - Scrapy shell: 是一个交互式工具，用于快速测试和调试选择器。 - Item Pipeline: 详细解释了数据处理流水线，包括清洗、验证和保存数据。 - Feed exports: 教程中展示了如何将爬取的数据导出到各种格式，如CSV、JSON等。 - Requests and Responses: 讨论了网络请求和响应对象，以及如何处理它们。 - Link Extractors: 用于自动提取网页中的链接，控制爬虫的抓取范围。 - Settings: 介绍Scrapy项目的设置，允许自定义框架的行为。 - Exceptions: 讨论了Scrapy中可能出现的异常及其处理方法。 3. **内置服务** - Logging: 解释了Scrapy的日志系统，帮助开发者追踪和调试问题。 - Stats Collection: 阐述了统计收集器，用于收集爬虫运行时的性能指标。 - Sending email: 教程涵盖了如何在Scrapy中发送电子邮件，例如报告或警报。 - Telnet Console: 介绍了通过telnet连接到Scrapy的内置控制台进行实时调试。 - Web Service: 提供了如何启用和使用Scrapy的Web API进行远程监控。 4. **解决特定问题** - Frequently Asked Questions: 收录了常见问题及解答，帮助用户解决常见问题。 - Debugging Spiders: 提供了调试爬虫的技巧和方法。 - Spiders Contracts: 介绍了爬虫契约，确保爬虫行为的一致性和可预测性。 - Common Practices: 分享了一些最佳实践，帮助用户编写高效且可靠的爬虫。 - Broad Crawls: 讨论了如何处理广度优先的爬网策略。 - Using Firefox for scraping: 展示了如何利用Firefox浏览器进行网页抓取。 - Using Firebug for scraping: 介绍了使用Firebug插件辅助爬虫开发。 - Debugging memory leaks: 讲解了如何检测和解决内存泄漏问题。 - Downloading and processing files and images: 解释了如何下载和处理网页中的文件和图片。 - Ubuntu packages: 提供了在Ubuntu上安装和管理Scrapy的包管理信息。 - Deploying Spiders: 教程涵盖了如何部署爬虫，以便在生产环境中运行。 - AutoThrottle extension: 介绍了自动限速扩展，用于智能调整请求速率。 - Benchmarking: 说明了如何对Scrapy进行性能基准测试。 - Jobs: 与作业相关的功能，可能涉及到爬虫的暂停和恢复。通过这个Scrapy教程，你可以掌握从构建爬虫到优化性能的全套技能，无论你是数据分析师、Web开发者还是研究者，都能从中受益。

Scrapy Documentation, Release 1.1.0

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

filename = response.url.split("/")[-2] + '.html'

with open(filename, 'wb') as f:

f.write(response.body)

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

This command runs the spider with name dmoz that we’ve just added, that will send some requests for the dmoz.org

domain. You will get an output similar to this:

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)

2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Spider opened

2014-01-23 18:13:08-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

2014-01-23 18:13:09-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

2014-01-23 18:13:09-0400 [scrapy] INFO: Closing spider (finished)

Note: At the end you can see a log line for each URL deﬁned in start_urls. Because these URLs are the starting

ones, they have no referrers, which is shown at the end of the log line, where it says (referer: None).

Now, check the ﬁles in the current directory. You should notice two new ﬁles have been created: Books.html and

Resources.html, with the content for the respective URLs, as our parse method instructs.

What just happened under the hood?

Scrapy creates scrapy.Request objects for each URL in the start_urls attribute of the Spider, and assigns

them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath or CSS expressions

called Scrapy Selectors. For more information about selectors and other extraction mechanisms see the Selectors

documentation.

Here are some examples of XPath expressions and their meanings:

12 Chapter 2. First steps

Scrapy Documentation, Release 1.1.0

• /html/head/title: selects the <title> element, inside the <head> element of an HTML document

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

• //td: selects all the <td> elements

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath, we recommend this tutorial to learn XPath through examples, and this

tutorial to learn “how to think in XPath”.

Note: CSS vs XPath: you can go a long way extracting data from web pages using only CSS selectors. However,

XPath offers more power because besides navigating the structure, it can also look at the content: you’re able to select

things like: the link that contains the text ‘Next Page’. Because of this, we encourage you to learn about XPath even if

you already know how to construct CSS selectors.

For working with CSS and XPath expressions, Scrapy provides Selector class and convenient shortcuts to avoid

instantiating selectors yourself every time you need to select something from a response.

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated with the root node, or the entire document.

Selectors have four basic methods (click on the method to see the complete API documentation):

• xpath(): returns a list of selectors, each of which represents the nodes selected by the xpath expression given

as argument.

• css(): returns a list of selectors, each of which represents the nodes selected by the CSS expression given as

argument.

• extract(): returns a unicode string with the selected data.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls

containing arguments (ie. & character) will not work.

This is what the shell looks like:

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

[s] Available Scrapy objects:

[s] crawler <scrapy.crawler.Crawler object at 0x3636b50>

[s] item {}

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 1.1.0

[s] settings <scrapy.settings.Settings object at 0x3fadc50>

[s] spider <Spider 'default' at 0x3cebf50>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

More importantly response has a selector attribute which is an instance of Selector

class, instantiated with this particular response. You can run queries on response by calling

response.selector.xpath() or response.selector.css(). There are also some convenience short-

cuts like response.xpath() or response.css() which map directly to response.selector.xpath()

and response.selector.css().

So let’s try it:

In [1]: response.xpath('//title')

Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]

In [2]: response.xpath('//title').extract()

Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']

In [3]: response.xpath('//title/text()')

Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]

In [4]: response.xpath('//title/text()').extract()

Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']

In [5]: response.xpath('//title/text()').re('(\w+):')

Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to ﬁgure out the XPaths you need to

use. However, inspecting the raw HTML code there could become a very tedious task. To make it easier, you can

use Firefox Developer Tools or some Firefox extensions like Firebug. For more information see Using Firebug for

scraping and Using Firefox for scraping.

After inspecting the page source, you’ll ﬁnd that the web site’s information is inside a <ul> element, in fact the

second <ul> element.

So we can select each <li> element belonging to the site’s list with this code:

response.xpath('//ul/li')

And from them, the site’s descriptions:

response.xpath('//ul/li/text()').extract()

The site’s titles:

14 Chapter 2. First steps

Scrapy Documentation, Release 1.1.0

class DmozSpider(scrapy.Spider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

for sel in response.xpath('//ul/li'):

item = DmozItem()

item['title'] = sel.xpath('a/text()').extract()

item['link'] = sel.xpath('a/@href').extract()

item['desc'] = sel.xpath('text()').extract()

yield item

Note: You can ﬁnd a fully-functional variant of this spider in the dirbot project available at

https://github.com/scrapy/dirbot

Now crawling dmoz.org yields DmozItem objects:

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],

'link': [u'http://gnosis.cx/TPiP/'],

'title': [u'Text Processing in Python']}

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],

'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],

'title': [u'XML Processing with Python']}

2.3.4 Following links

Let’s say, instead of just scraping the stuff in Books and Resources pages, you want everything that is under the Python

directory.

Now that you know how to extract data from a page, why not extract the links for the pages you are interested, follow

them and then extract the data you want for all of them?

Here is a modiﬁcation to our spider that does just that:

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/",

]

def parse(self, response):

for href in response.css("ul.directory.dir-col > li > a::attr('href')"):

url = response.urljoin(href.extract())

yield scrapy.Request(url, callback=self.parse_dir_contents)

16 Chapter 2. First steps

剩余245页未读，继续阅读

zzk1995

粉丝: 80
资源: 3

Scrapy入门教程：从基础到实践

Scrapy教程：从入门到实践指南

Scrapy教程：从入门到精通

Python框架Scrapy教程：网站数据收集指南

scrapy 教程

python scrapy 爬虫基础 分布式爬虫 scrapy 教程【5.3G】_python scrapy教程

python scrapy教程

Tutorialspoint Scrapy 教程.epub

Python网络爬虫实战-Scrapy教程

Python爬虫框架Scrapy教程《PDF》

Python爬虫框架Scrapy教程 完整版PDF

最新资源

python scrapy 爬虫基础分布式爬虫 scrapy 教程【5.3G】_python scrapy教程

Python爬虫框架Scrapy教程完整版PDF