深入探索Python Scrapy爬虫框架

需积分: 13 191 浏览量更新于2024-07-18 收藏 1.29MB PDF 举报

"这是一本关于Python Scrapy爬虫框架的详细指南，主要面向英语基础扎实的学习者。书中深入解析了Scrapy的工作原理，并提供了丰富的实际代码和示例，帮助读者掌握这一强大的网络爬虫工具。" Scrapy是一个用Python编写的高级爬虫框架，它简化了网页抓取和数据提取的过程。此书详细介绍了Scrapy的基本概念和使用方法，包括以下几个方面： 1. **获取帮助**：书中可能涵盖了如何在遇到问题时找到Scrapy社区和文档的支持，以及如何通过官方渠道获得帮助。 2. **初识Scrapy**：这部分从宏观上介绍Scrapy的功能和架构，帮助读者快速了解其工作方式。 3. **安装指南**：详细说明了安装Scrapy的步骤，包括系统需求、安装过程和可能遇到的问题。 4. **Scrapy教程**：提供了一个逐步指导的Scrapy项目实例，让读者通过实践来学习。 5. **示例**：包含多个实际的爬虫代码示例，以展示Scrapy在不同场景下的应用。 6. **基本概念**： - **命令行工具**：讲解如何使用Scrapy命令行接口进行项目创建、运行和其他操作。 - **Items**：Scrapy中的数据结构，用于定义要抓取的数据模式。 - **Spiders**：爬虫的核心部分，负责解析网页并生成请求或提取数据。 - **Link Extractors**：用于从网页中提取链接的组件，帮助构建爬虫的抓取范围。 - **Selectors**：基于XPath或CSS选择器的工具，用于从HTML或XML文档中提取数据。 - **Item Loaders**：方便地将数据加载到Items的工具，处理数据清洗和转换。 - **Scrapy Shell**：交互式的命令行工具，用于测试和调试选择器和链接提取器。 - **Item Pipeline**：处理Items的流水线，可以实现数据清洗、验证和存储等操作。 - **Feed Exports**：功能允许将爬取的数据导出为各种格式，如CSV、JSON等。 - **Link Extractors**（重复标签）：再次提及，可能是更详细的链接处理技术。 7. **内置服务**：涵盖Scrapy自带的一些实用工具，如日志记录、统计收集、邮件发送、telnet控制台和web服务。 8. **解决特定问题**：针对常见问题、调试技巧、Spider Contracts（用于自动测试爬虫行为）、最佳实践、大规模爬取、使用Firefox和Firebug进行调试、内存泄漏检测、图片下载、Ubuntu包管理、Scrapyd（分布式爬虫部署）、AutoThrottle（动态速率调整）、基准测试、暂停与恢复爬取的Job功能、以及Django Item（与Django模型集成）等内容。 9. **扩展Scrapy**：这部分可能涉及如何根据需求自定义和扩展Scrapy的功能，包括编写新的中间件、爬虫和管道等。这本书提供了一个全面的Scrapy学习路径，无论你是初学者还是有经验的开发者，都能从中受益。通过阅读和实践，你可以掌握Scrapy框架，从而高效地进行网络数据抓取和处理。

Scrapy Documentation, Release 0.17.0

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

filename = response.url.split("/")[-2]

open(filename, ’wb’).write(response.body)

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

The crawl dmoz command runs the spider for the dmoz.org domain. You will get an output similar to this:

2008-08-20 03:51:13-0300 [scrapy] INFO: Started project: dmoz

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled extensions: ...

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled downloader middlewares: ...

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled spider middlewares: ...

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled item pipelines: ...

2008-08-20 03:51:14-0300 [dmoz] INFO: Spider opened

2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)

2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)

2008-08-20 03:51:14-0300 [dmoz] INFO: Spider closed (finished)

Pay attention to the lines containing [dmoz], which corresponds to our spider. You can see a log line for each URL

deﬁned in start_urls. Because these URLs are the starting ones, they have no referrers, which is shown at the end

of the log line, where it says (referer: <None>).

But more interesting, as our parse method instructs, two ﬁles have been created: Books and Resources, with the

content of both URLs.

What just happened under the hood?

Scrapy creates scrapy.http.Request objects for each URL in the start_urls attribute of the Spider, and

assigns them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath expressions called

XPath selectors. For more information about selectors and other extraction mechanisms see the XPath selectors docu-

mentation.

Here are some examples of XPath expressions and their meanings:

• /html/head/title: selects the <title> element, inside the <head> element of a HTML document

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

• //td: selects all the <td> elements

12 Chapter 2. First steps

Scrapy Documentation, Release 0.17.0

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath we recommend this XPath tutorial.

For working with XPaths, Scrapy provides a XPathSelector class, which comes in two ﬂavours,

HtmlXPathSelector (for HTML data) and XmlXPathSelector (for XML data). In order to use them you

must instantiate the desired class with a Response object.

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated to the root node, or the entire document.

Selectors have three methods (click on the method to see the complete API documentation).

• select(): returns a list of selectors, each of them representing the nodes selected by the xpath expression

given as argument.

• extract(): returns a unicode string with the data selected by the XPath selector.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

This is what the shell looks like:

[ ... Scrapy log here ... ]

[s] Available Scrapy objects:

[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)

[s] hxs <HtmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>

[s] item Item()

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] spider <BaseSpider ’default’ at 0x1b6c2d0>

[s] xxs <XmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>

[s] Useful shortcuts:

[s] shelp() Print this help

[s] fetch(req_or_url) Fetch a new request or URL and update shell objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

The shell also instantiates two selectors, one for HTML (in the hxs variable) and one for XML (in the xxs variable)

with this response. So let’s try them:

In [1]: hxs.select(’//title’)

Out[1]: [<HtmlXPathSelector (title) xpath=//title>]

In [2]: hxs.select(’//title’).extract()

Out[2]: [u’<title>Open Directory - Computers: Programming: Languages: Python: Books</title>’]

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 0.17.0

In [3]: hxs.select(’//title/text()’)

Out[3]: [<HtmlXPathSelector (text) xpath=//title/text()>]

In [4]: hxs.select(’//title/text()’).extract()

Out[4]: [u’Open Directory - Computers: Programming: Languages: Python: Books’]

In [5]: hxs.select(’//title/text()’).re(’(\w+):’)

Out[5]: [u’Computers’, u’Programming’, u’Languages’, u’Python’]

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to ﬁgure out the XPaths you need to

use. However, inspecting the raw HTML code there could become a very tedious task. To make this an easier task,

you can use some Firefox extensions like Firebug. For more information see Using Firebug for scraping and Using

Firefox for scraping.

After inspecting the page source, you’ll ﬁnd that the web sites information is inside a <ul> element, in fact the second

<ul> element.

So we can select each <li> element belonging to the sites list with this code:

hxs.select(’//ul/li’)

And from them, the sites descriptions:

hxs.select(’//ul/li/text()’).extract()

The sites titles:

hxs.select(’//ul/li/a/text()’).extract()

And the sites links:

hxs.select(’//ul/li/a/@href’).extract()

As we said before, each select() call returns a list of selectors, so we can concatenate further select() calls to

dig deeper into a node. We are going to use that property here, so:

sites = hxs.select(’//ul/li’)

for site in sites:

title = site.select(’a/text()’).extract()

link = site.select(’a/@href’).extract()

desc = site.select(’text()’).extract()

print title, link, desc

Note: For a more detailed description of using nested selectors, see Nesting selectors and Working with relative

XPaths in the Selectors documentation

Let’s add this code to our spider:

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

class DmozSpider(BaseSpider):

name = "dmoz"

14 Chapter 2. First steps

Scrapy Documentation, Release 0.17.0

Note: You can ﬁnd a fully-functional variant of this spider in the dirbot project available at

https://github.com/scrapy/dirbot

Now doing a crawl on the dmoz.org domain yields DmozItem‘s:

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{’desc’: [u’ - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],

’link’: [u’http://gnosis.cx/TPiP/’],

’title’: [u’Text Processing in Python’]}

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{’desc’: [u’ - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n’],

’link’: [u’http://www.informit.com/store/product.aspx?isbn=0130211192’],

’title’: [u’XML Processing with Python’]}

2.3.4 Storing the scraped data

The simplest way to store the scraped data is by using the Feed exports, with the following command:

scrapy crawl dmoz -o items.json -t json

That will generate a items.json ﬁle containing all scraped items, serialized in JSON.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex

things with the scraped items, you can write an Item Pipeline. As with Items, a placeholder ﬁle for Item Pipelines

has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to

implement any item pipeline if you just want to store the scraped items.

2.3.5 Next steps

This tutorial covers only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What

else? section in Scrapy at a glance chapter for a quick overview of the most important ones.

Then, we recommend you continue by playing with an example project (see Examples), and then continue with the

section Basic concepts.

2.4 Examples

The best way to learn is with examples, and Scrapy is no exception. For this reason, there is an example Scrapy project

named dirbot, that you can use to play and learn more about Scrapy. It contains the dmoz spider described in the

tutorial.

This dirbot project is available at: https://github.com/scrapy/dirbot

It contains a README ﬁle with a detailed description of the project contents.

If you’re familiar with git, you can checkout the code. Otherwise you can download a tarball or zip ﬁle of the project

by clicking on Downloads.

The scrapy tag on Snipplr is used for sharing code snippets such as spiders, middlewares, extensions, or scripts. Feel

free (and encouraged!) to share any code there.

Scrapy at a glance Understand what Scrapy is and how it can help you.

Installation guide Get Scrapy installed on your computer.

Scrapy Tutorial Write your ﬁrst Scrapy project.

16 Chapter 2. First steps

剩余190页未读，继续阅读

99c

粉丝: 19
资源: 12

深入探索Python Scrapy爬虫框架

Python Scrapy爬虫系统实现腾讯职位数据采集

Python Scrapy框架详解：结构、运作与挑战

PyCharm中调试Scrapy爬虫步骤详解

pythonscrapy爬虫实例Python爬虫Scrapy实例

Python中Scrapy爬虫图片处理详解

python爬虫scrapy框架详解

Python之Scrapy爬虫框架安装及使用详解

Python之Scrapy爬虫框架安装及简单使用详解

详解python3 + Scrapy爬虫学习之创建项目

Python爬虫Scrapy详解及项目实战

最新资源