Python Scrapy入门与实战教程

python

scrapy

需积分: 12 110 浏览量更新于2023-06-02 7 收藏 1MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

本资源是一份详细的Python Scrapy教程，适合不同水平的学习者。Scrapy是一个强大的网络爬虫框架，用于高效地抓取和处理网站数据。教程从入门到深入，包括以下几个关键知识点： 1. **入门帮助**：对于初学者，章节1提供了一个快速概览，介绍Scrapy的基本概念和用途。这可以帮助读者理解Scrapy在网页抓取中的作用。 2. **安装与基础步骤**： - **安装指南**：介绍了如何安装Scrapy及其依赖，确保环境配置正确。 - **Scrapy教程**：逐步指导如何创建第一个爬虫，包括设置项目结构、编写spider和selector等。 3. **基本概念**： - **命令行工具**：讲解了Scrapy的命令行接口，用于管理和控制爬虫运行。 - **蜘蛛（Spiders）**：阐述了如何设计和实现数据抓取逻辑，包括定义start_urls、解析响应和处理数据。 - **选择器（Selectors）**：介绍XPath和CSS选择器，这两种用于从网页中提取数据的常用方法。 - **物品（Items）**：定义了数据模型，存储抓取到的数据结构。 - **ItemLoader**：讲解如何使用ItemLoader进行数据清洗和转换。 - **Scrapy Shell**：提供了交互式环境来测试和调试抓取代码。 - **Item Pipeline**：处理数据清洗、验证和持久化的过程。 - **数据导出**：介绍如何将抓取的数据存储到各种格式，如CSV、JSON等。 - **请求与响应**：解释了HTTP请求和响应的工作原理。 - **链接提取器（Link Extractors）**：处理网页中的链接发现。 - **设置（Settings）**：讲解Scrapy的配置选项，调整爬虫性能和行为。 - **异常处理**：介绍遇到错误时的应对策略和异常管理。 4. **内置服务**： - **日志**：如何记录和分析爬虫运行过程中的信息。 - **统计收集**：跟踪和报告爬虫的性能指标。 - **电子邮件发送**：如何通过Scrapy发送爬取结果或报告。 - **telnet console**：用于测试和调试的实时通信工具。 - **Web服务支持**：Scrapy如何与远程服务集成。 5. **解决特定问题**： - **常见问题解答**：针对初学者可能遇到的问题提供解决方案。 - **调试spiders**：讲解如何定位和修复代码中的错误。 - **spider合同**：关于spider设计的最佳实践和规范。 - **通用实践**：分享数据抓取过程中的实用技巧和策略。 - **广泛抓取**：讨论如何处理大型网站或深层链接抓取。 - **利用浏览器工具**：Firefox和Firebug的使用方法，增强开发者工具的理解。 - **内存泄漏检测**：确保代码的内存效率。 - **文件和图片下载与处理**：涉及下载和处理媒体文件的方法。 - **Ubuntu包管理**：针对Linux用户的安装和使用指导。 - **部署**：说明如何将Scrapy部署到生产环境。 - **扩展功能**：如AutoThrottle，控制爬虫速度的插件。 - **性能基准**：评估和优化爬虫性能。这份教程覆盖了从基础到高级的Scrapy使用，无论你是Python新手还是经验丰富的开发人员，都能从中找到所需的信息来构建高效的网络爬虫系统。

资源详情

资源推荐

Scrapy Documentation, Release 1.0.1

things like: the link that contains the text ‘Next Page’. Because of this, we encourage you to learn about XPath even if

you already know how to construct CSS selectors.

For working with CSS and XPath expressions, Scrapy provides Selector class and convenient shortcuts to avoid

instantiating selectors yourself every time you need to select something from a response.

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated with the root node, or the entire document.

Selectors have four basic methods (click on the method to see the complete API documentation):

• xpath(): returns a list of selectors, each of which represents the nodes selected by the xpath expression given

as argument.

• css(): returns a list of selectors, each of which represents the nodes selected by the CSS expression given as

argument.

• extract(): returns a unicode string with the selected data.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls

containing arguments (ie. & character) will not work.

This is what the shell looks like:

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

[s] Available Scrapy objects:

[s] crawler <scrapy.crawler.Crawler object at 0x3636b50>

[s] item {}

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] settings <scrapy.settings.Settings object at 0x3fadc50>

[s] spider <Spider 'default' at 0x3cebf50>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

More importantly response has a selector attribute which is an instance of Selector

class, instantiated with this particular response. You can run queries on response by calling

12 Chapter 2. First steps

Scrapy Documentation, Release 1.0.1

response.selector.xpath() or response.selector.css(). There are also some convenience short-

cuts like response.xpath() or response.xml() which map directly to response.selector.xpath()

and response.selector.css().

So let’s try it:

In [1]: response.xpath('//title')

Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]

In [2]: response.xpath('//title').extract()

Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']

In [3]: response.xpath('//title/text()')

Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]

In [4]: response.xpath('//title/text()').extract()

Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']

In [5]: response.xpath('//title/text()').re('(\w+):')

Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to ﬁgure out the XPaths you need to

use. However, inspecting the raw HTML code there could become a very tedious task. To make it easier, you can

use Firefox Developer Tools or some Firefox extensions like Firebug. For more information see Using Firebug for

scraping and Using Firefox for scraping.

After inspecting the page source, you’ll ﬁnd that the web site’s information is inside a <ul> element, in fact the

second <ul> element.

So we can select each <li> element belonging to the site’s list with this code:

response.xpath('//ul/li')

And from them, the site’s descriptions:

response.xpath('//ul/li/text()').extract()

The site’s titles:

response.xpath('//ul/li/a/text()').extract()

And the site’s links:

response.xpath('//ul/li/a/@href').extract()

As we’ve said before, each .xpath() call returns a list of selectors, so we can concatenate further .xpath() calls

to dig deeper into a node. We are going to use that property here, so:

for sel in response.xpath('//ul/li'):

title = sel.xpath('a/text()').extract()

link = sel.xpath('a/@href').extract()

desc = sel.xpath('text()').extract()

print title, link, desc

Note: For a more detailed description of using nested selectors, see Nesting selectors and Working with relative

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 1.0.1

Note: You can ﬁnd a fully-functional variant of this spider in the dirbot project available at

https://github.com/scrapy/dirbot

Now crawling dmoz.org yields DmozItem objects:

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],

'link': [u'http://gnosis.cx/TPiP/'],

'title': [u'Text Processing in Python']}

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],

'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],

'title': [u'XML Processing with Python']}

2.3.4 Following links

Let’s say, instead of just scraping the stuff in Books and Resources pages, you want everything that is under the Python

directory.

Now that you know how to extract data from a page, why not extract the links for the pages you are interested, follow

them and then extract the data you want for all of them?

Here is a modiﬁcation to our spider that does just that:

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/",

]

def parse(self, response):

for href in response.css("ul.directory.dir-col > li > a::attr('href')"):

url = response.urljoin(href.extract())

yield scrapy.Request(url, callback=self.parse_dir_contents)

def parse_dir_contents(self, response):

for sel in response.xpath('//ul/li'):

item = DmozItem()

item['title'] = sel.xpath('a/text()').extract()

item['link'] = sel.xpath('a/@href').extract()

item['desc'] = sel.xpath('text()').extract()

yield item

Now the parse() method only extract the interesting links from the page, builds a full absolute URL using the re-

sponse.urljoin method (since the links can be relative) and yields new requests to be sent later, registering as callback

the method parse_dir_contents() that will ultimately scrape the data we want.

What you see here is the Scrapy’s mechanism of following links: when you yield a Request in a callback method,

Scrapy will schedule that request to be sent and register a callback method to be executed when that request ﬁnishes.

Using this, you can build complex crawlers that follow links according to rules you deﬁne, and extract different kinds

of data depending on the page it’s visiting.

2.3. Scrapy Tutorial 15

Scrapy Documentation, Release 1.0.1

A common pattern is a callback method that extract some items, looks for a link to follow to the next page and then

yields a Request with the same callback for it:

def parse_articles_follow_next_page(self, response):

for article in response.xpath("//article"):

item = ArticleItem()

... extract article data here

yield item

next_page = response.css("ul.navigation > li.next-page > a::attr('href')")

if next_page:

url = response.urljoin(next_page[0].extract())

yield Request(url, self.parse_articles_follow_next_page)

This creates a sort of loop, following all the links to the next page until it doesn’t ﬁnd one – handy for crawling blogs,

forums and other sites with pagination.

Another common pattern is to build an item with data from more than one page, using a trick to pass additional data

to the callbacks.

Note: As an example spider that leverages this mechanism, check out the CrawlSpider class for a generic spider

that implements a small rules engine that you can use to write your crawlers on top of it.

2.3.5 Storing the scraped data

The simplest way to store the scraped data is by using Feed exports, with the following command:

scrapy crawl dmoz -o items.json

That will generate an items.json ﬁle containing all scraped items, serialized in JSON.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex

things with the scraped items, you can write an Item Pipeline. As with Items, a placeholder ﬁle for Item Pipelines

has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to

implement any item pipelines if you just want to store the scraped items.

2.3.6 Next steps

This tutorial covered only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What

else? section in Scrapy at a glance chapter for a quick overview of the most important ones.

Then, we recommend you continue by playing with an example project (see Examples), and then continue with the

section Basic concepts.

2.4 Examples

The best way to learn is with examples, and Scrapy is no exception. For this reason, there is an example Scrapy project

named dirbot, that you can use to play and learn more about Scrapy. It contains the dmoz spider described in the

tutorial.

This dirbot project is available at: https://github.com/scrapy/dirbot

It contains a README ﬁle with a detailed description of the project contents.

16 Chapter 2. First steps

剩余229页未读，继续阅读

gaoyongcai1984

粉丝: 0
资源: 2

会员权益专享

Python Scrapy入门与实战教程

python3 scrapy安装教程（详细）

python scrapy 爬虫基础 分布式爬虫 scrapy 教程【5.3G】_python scrapy教程

完整版Python网络爬虫之Scrapy爬虫框架使用案例教程含源代码共18页.pdf

python爬虫scrapy框架教程_Python爬虫框架Scrapy基本用法入门教程

python爬虫scrapy框架教程

Python scrapy 框架

python 怎么安装scrapy

python爬虫scrapy框架 conda安装教程

CSDN Scrapy教程

pycharm安装scrapy教程

爬虫python入门 教程 下载

Scrapy安装教程

scrapy安装教程

scrapy教程 pycharm

pycharm scrapy安装教程

python的爬虫教程你有推荐嘛

python爬虫教程抢票

scrapy网页爬虫教程

python爬虫自学教程推荐

scrapy爬虫框架教程

会员权益专享

最新资源

python scrapy 爬虫基础分布式爬虫 scrapy 教程【5.3G】_python scrapy教程

爬虫python入门教程下载