Scrapy教程：从入门到精通

需积分: 9 18 浏览量更新于2024-07-22 收藏 958KB PDF 举报

"Scrapy教程" Scrapy是一个强大的Python爬虫框架，用于高效地抓取网页数据和处理网络请求。这个教程旨在帮助用户快速理解和掌握Scrapy的架构、使用方法及其核心概念。首先，Scrapy的基本结构包括命令行工具、项目结构、Items、Spiders、Selectors、ItemLoaders、Item Pipeline、Scrapy Shell、Link Extractors以及各种内置服务。在入门阶段，你需要了解Scrapy的整体概览，包括如何安装和设置环境，以便开始你的第一个Scrapy项目。 Scrapy的核心概念之一是命令行工具，它提供了一系列命令来创建项目、启动爬虫、查看日志等。Items是定义要抓取数据的数据结构，类似于字典，方便数据处理和存储。Spiders是Scrapy中的核心组件，负责定义爬取规则和解析网页内容。Selectors基于XPath或CSS选择器，用于从HTML或XML文档中提取数据。ItemLoaders则是用来填充Items的工具，它可以与Selectors结合，简化数据处理过程。 Item Pipeline是Scrapy处理数据流的重要部分，它允许你在数据被存储之前进行清洗、验证和转换。Feed Exports则提供了将爬取结果导出到各种格式（如JSON、CSV）的功能。Link Extractors用于自动识别和管理页面中的链接，帮助实现自动爬取。 Scrapy还提供了一些内置服务，如日志系统，它可以帮助调试和监控爬虫的运行状态；StatsCollection用于收集爬虫运行的统计信息；发送邮件功能可以在特定事件发生时通知用户； Telnet Console和Web Service提供交互式控制台和远程接口，以监控和调整爬虫行为。在解决特定问题方面，Scrapy提供了FAQ、调试蜘蛛的方法、Spiders Contracts（确保爬虫行为的一致性）、最佳实践、大规模爬取策略、使用Firefox和Firebug进行网页调试，以及内存泄漏检测。此外，Scrapy支持下载网页图片，提供了Ubuntu软件包安装方式，以及Scrapyd服务，用于部署和调度爬虫。AutoThrottle扩展用于动态调整请求速率，避免对目标网站造成过大压力。Jobs特性允许暂停和恢复爬取，而Django Item则允许与Django框架集成。最后，Scrapy的可扩展性使其能够通过中间件、下载器扩展、爬虫扩展等方式自定义其行为，以适应各种复杂的爬取需求。通过深入理解这些核心概念和特性，你将能够充分利用Scrapy的强大功能，构建高效且灵活的网络爬虫。

Scrapy Documentation, Release 0.24.4

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

The crawl dmoz command runs the spider for the dmoz.org domain. You will get an output similar to this:

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)

2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...

2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened

2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

2014-01-23 18:13:09-0400 [dmoz] INFO: Closing spider (finished)

Pay attention to the lines containing [dmoz], which corresponds to our spider. You can see a log line for each URL

deﬁned in start_urls. Because these URLs are the starting ones, they have no referrers, which is shown at the end

of the log line, where it says (referer: None).

But more interesting, as our parse method instructs, two ﬁles have been created: Books and Resources, with the

content of both URLs.

What just happened under the hood?

Scrapy creates scrapy.Request objects for each URL in the start_urls attribute of the Spider, and assigns

them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath or CSS expressions

called Scrapy Selectors. For more information about selectors and other extraction mechanisms see the Selectors

documentation.

Here are some examples of XPath expressions and their meanings:

• /html/head/title: selects the <title> element, inside the <head> element of a HTML document

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

• //td: selects all the <td> elements

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath we recommend this XPath tutorial.

For working with XPaths, Scrapy provides Selector class and convenient shortcuts to avoid instantiating selectors

yourself everytime you need to select something from a response.

12 Chapter 2. First steps

Scrapy Documentation, Release 0.24.4

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated with the root node, or the entire document.

Selectors have four basic methods (click on the method to see the complete API documentation):

• xpath(): returns a list of selectors, each of them representing the nodes selected by the xpath expression given

as argument.

• css(): returns a list of selectors, each of them representing the nodes selected by the CSS expression given as

argument.

• extract(): returns a unicode string with the selected data.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Note: Remember to always enclose urls with quotes when running Scrapy shell from command-line, otherwise urls

containing arguments (ie. & character) will not work.

This is what the shell looks like:

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [default] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

[s] Available Scrapy objects:

[s] crawler <scrapy.crawler.Crawler object at 0x3636b50>

[s] item {}

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] settings <scrapy.settings.Settings object at 0x3fadc50>

[s] spider <Spider ’default’ at 0x3cebf50>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

More important, if you type response.selector you will access a selector object you can use to

query the response, and convenient shortcuts like response.xpath() and response.css() mapping to

response.selector.xpath() and response.selector.css()

So let’s try it:

In [1]: response.xpath(’//title’)

Out[1]: [<Selector xpath=’//title’ data=u’<title>Open Directory - Computers: Progr’>]

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 0.24.4

In [2]: response.xpath(’//title’).extract()

Out[2]: [u’<title>Open Directory - Computers: Programming: Languages: Python: Books</title>’]

In [3]: response.xpath(’//title/text()’)

Out[3]: [<Selector xpath=’//title/text()’ data=u’Open Directory - Computers: Programming:’>]

In [4]: response.xpath(’//title/text()’).extract()

Out[4]: [u’Open Directory - Computers: Programming: Languages: Python: Books’]

In [5]: response.xpath(’//title/text()’).re(’(\w+):’)

Out[5]: [u’Computers’, u’Programming’, u’Languages’, u’Python’]

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to ﬁgure out the XPaths you need to

use. However, inspecting the raw HTML code there could become a very tedious task. To make this an easier task,

you can use some Firefox extensions like Firebug. For more information see Using Firebug for scraping and Using

Firefox for scraping.

After inspecting the page source, you’ll ﬁnd that the web sites information is inside a <ul> element, in fact the second

<ul> element.

So we can select each <li> element belonging to the sites list with this code:

sel.xpath(’//ul/li’)

And from them, the sites descriptions:

sel.xpath(’//ul/li/text()’).extract()

The sites titles:

sel.xpath(’//ul/li/a/text()’).extract()

And the sites links:

sel.xpath(’//ul/li/a/@href’).extract()

As we’ve said before, each .xpath() call returns a list of selectors, so we can concatenate further .xpath() calls

to dig deeper into a node. We are going to use that property here, so:

for sel in response.xpath(’//ul/li’):

title = sel.xpath(’a/text()’).extract()

link = sel.xpath(’a/@href’).extract()

desc = sel.xpath(’text()’).extract()

print title, link, desc

Note: For a more detailed description of using nested selectors, see Nesting selectors and Working with relative

XPaths in the Selectors documentation

Let’s add this code to our spider:

import scrapy

class DmozSpider(scrapy.Spider):

name = "dmoz"

14 Chapter 2. First steps

Scrapy Documentation, Release 0.24.4

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

for sel in response.xpath(’//ul/li’):

title = sel.xpath(’a/text()’).extract()

link = sel.xpath(’a/@href’).extract()

desc = sel.xpath(’text()’).extract()

print title, link, desc

Now try crawling the dmoz.org domain again and you’ll see sites being printed in your output, run:

scrapy crawl dmoz

Using our item

Item objects are custom python dicts; you can access the values of their ﬁelds (attributes of the class we deﬁned

earlier) using the standard dict syntax like:

>>> item = DmozItem()

>>> item[’title’] = ’Example title’

>>> item[’title’]

’Example title’

Spiders are expected to return their scraped data inside Item objects. So, in order to return the data we’ve scraped so

far, the ﬁnal code for our Spider would be like this:

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

for sel in response.xpath(’//ul/li’):

item = DmozItem()

item[’title’] = sel.xpath(’a/text()’).extract()

item[’link’] = sel.xpath(’a/@href’).extract()

item[’desc’] = sel.xpath(’text()’).extract()

yield item

Note: You can ﬁnd a fully-functional variant of this spider in the dirbot project available at

https://github.com/scrapy/dirbot

Now doing a crawl on the dmoz.org domain yields DmozItem objects:

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{’desc’: [u’ - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],

2.3. Scrapy Tutorial 15

Scrapy Documentation, Release 0.24.4

’link’: [u’http://gnosis.cx/TPiP/’],

’title’: [u’Text Processing in Python’]}

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{’desc’: [u’ - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n’],

’link’: [u’http://www.informit.com/store/product.aspx?isbn=0130211192’],

’title’: [u’XML Processing with Python’]}

2.3.4 Storing the scraped data

The simplest way to store the scraped data is by using the Feed exports, with the following command:

scrapy crawl dmoz -o items.json

That will generate a items.json ﬁle containing all scraped items, serialized in JSON.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex

things with the scraped items, you can write an Item Pipeline. As with Items, a placeholder ﬁle for Item Pipelines

has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to

implement any item pipelines if you just want to store the scraped items.

2.3.5 Next steps

This tutorial covers only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What

else? section in Scrapy at a glance chapter for a quick overview of the most important ones.

Then, we recommend you continue by playing with an example project (see Examples), and then continue with the

section Basic concepts.

2.4 Examples

The best way to learn is with examples, and Scrapy is no exception. For this reason, there is an example Scrapy project

named dirbot, that you can use to play and learn more about Scrapy. It contains the dmoz spider described in the

tutorial.

This dirbot project is available at: https://github.com/scrapy/dirbot

It contains a README ﬁle with a detailed description of the project contents.

If you’re familiar with git, you can checkout the code. Otherwise you can download a tarball or zip ﬁle of the project

by clicking on Downloads.

The scrapy tag on Snipplr is used for sharing code snippets such as spiders, middlewares, extensions, or scripts. Feel

free (and encouraged!) to share any code there.

Scrapy at a glance Understand what Scrapy is and how it can help you.

Installation guide Get Scrapy installed on your computer.

Scrapy Tutorial Write your ﬁrst Scrapy project.

Examples Learn more by playing with a pre-made Scrapy project.

16 Chapter 2. First steps

剩余206页未读，继续阅读

haliboteshalou

粉丝: 0
资源: 4

Scrapy教程：从入门到精通

scrapy教程

scrapy 中文教程

scrapy入门教程

python scrapy 爬虫基础 分布式爬虫 scrapy 教程【5.3G】_python scrapy教程

python scrapy教程

Tutorialspoint Scrapy 教程.epub

Python网络爬虫实战-Scrapy教程

Python爬虫框架Scrapy教程《PDF》

Python爬虫框架Scrapy教程 完整版PDF

Python爬虫框架Scrapy教程（PDF）

最新资源

python scrapy 爬虫基础分布式爬虫 scrapy 教程【5.3G】_python scrapy教程

Python爬虫框架Scrapy教程完整版PDF