Python网络爬虫框架Scrapy入门指南

4星 · 超过85%的资源需积分: 9 68 浏览量更新于2024-07-28 4 收藏 1.18MB PDF 举报

"Scrapy是Python编程语言下的一款强大而灵活的网络爬虫框架，用于高效地抓取网站数据并提取结构化的信息。这个框架包含了众多功能，如命令行工具、物品（Items）、蜘蛛（Spiders）、链接提取器（Link Extractors）、XPath选择器（XPath Selectors）以及数据处理管道（Item Pipeline）等。Scrapy还提供了内置的服务，如日志记录、统计收集、电子邮件发送、Telnet控制台和Web服务接口。此外，Scrapy文档涵盖了如何解决特定问题，如内存泄漏调试、图片下载、Ubuntu包管理，以及通过Scrapy Service（scrapyd）来部署和管理爬虫项目。Scrapy还支持扩展，包括下载中间件、蜘蛛中间件和自定义扩展，允许开发者根据需求定制爬虫功能。" Scrapy是Python中的一个开源网络爬虫框架，专为快速开发和处理大量网页数据而设计。它提供了许多开箱即用的功能，简化了爬虫的编写过程。以下是一些关键知识点的详细说明： 1. **Scrapy概览**：Scrapy由多个组件组成，如引擎、调度器、下载器、蜘蛛、物品管道等，它们协同工作以实现网页抓取和数据提取。 2. **安装指南**：在Python环境中安装Scrapy，通常可以通过pip命令完成，确保安装了所有依赖库，如Twisted和w3lib。 3. **Scrapy教程**：教程介绍了如何创建基本的Scrapy项目，编写蜘蛛，定义物品，设置链接提取器，以及配置数据处理管道。 4. **Items**：Items是Scrapy中用于定义要抓取的数据结构，类似于Python字典，可以包含各种字段，便于数据建模和存储。 5. **Spiders**：蜘蛛是Scrapy的核心部分，负责解析网页内容，定义如何抓取链接和提取数据。开发者可以自定义多个蜘蛛来适应不同的网站结构。 6. **Link Extractors**：用于从HTML或XML文档中提取链接，支持多种规则，如正则表达式，以控制链接的过滤和选择。 7. **XPath Selectors**：XPath是一种在XML和HTML文档中选取节点的语言，Scrapy使用XPath来选取网页元素，方便数据抽取。 8. **Item Loaders**：Item Loaders是处理Item数据的工具，允许开发者将提取的片段合并到Item中，并应用清洗和转换操作。 9. **Item Pipeline**：Item Pipeline是Scrapy处理抓取数据的流程，可以进行数据清洗、验证、存储等操作，确保数据质量。 10. **内置服务**：Scrapy提供日志记录功能，便于调试和监控；StatsCollection收集爬虫运行时的统计信息；通过Telnet Console或Web Service提供交互式界面；邮件通知功能可以在特定条件下发送邮件。 11. **解决特定问题**：文档还包括了如何在Firefox中进行网页抓取、利用Firebug辅助调试，以及如何检测和处理内存泄漏等问题。 12. **Scrapy Service (scrapyd)**：scrapyd是一个部署和管理Scrapy爬虫的服务器，允许远程启动、停止和监控爬虫。 13. **扩展Scrapy**：Scrapy的架构设计允许开发者自定义下载中间件、蜘蛛中间件和扩展，以满足特定需求，如自定义下载行为、处理HTTP响应、拦截请求等。通过这些组件和功能，Scrapy提供了一个强大且灵活的平台，让开发者能够高效地构建复杂的网络爬虫，实现对网页数据的深度挖掘和结构化处理。

Scrapy Documentation, Release 0.14.4

• tutorial/pipelines.py: the project’s pipelines ﬁle.

• tutorial/settings.py: the project’s settings ﬁle.

• tutorial/spiders/: a directory where you’ll later put your spiders.

2.3.2 Deﬁning our Item

Items are containers that will be loaded with the scraped data; they work like simple python dicts but provide additional

protecting against populating undeclared ﬁelds, to prevent typos.

They are declared by creating an scrapy.item.Item class an deﬁning its attributes as scrapy.item.Field

objects, like you will in an ORM (don’t worry if you’re not familiar with ORMs, you will see that this is an easy task).

We begin by modeling the item that we will use to hold the sites data obtained from dmoz.org, as we want to capture

the name, url and description of the sites, we deﬁne ﬁelds for each of these three attributes. To do that, we edit items.py,

found in the tutorial directory. Our Item class looks like this:

from scrapy.item import Item, Field

class DmozItem(Item):

title = Field()

link = Field()

desc = Field()

This may seem complicated at ﬁrst, but deﬁning the item allows you to use other handy components of Scrapy that

need to know how your item looks like.

2.3.3 Our ﬁrst Spider

Spiders are user-written classes used to scrape information from a domain (or group of domains).

They deﬁne an initial list of URLs to download, how to follow links, and how to parse the contents of those pages to

extract items.

To create a Spider, you must subclass scrapy.spider.BaseSpider, and deﬁne the three main, mandatory,

attributes:

• name: identiﬁes the Spider. It must be unique, that is, you can’t set the same name for different Spiders.

• start_urls: is a list of URLs where the Spider will begin to crawl from. So, the ﬁrst pages downloaded

will be those listed here. The subsequent URLs will be generated successively from data contained in the start

URLs.

• parse() is a method of the spider, which will be called with the downloaded Response object of each start

URL. The response is passed to the method as the ﬁrst and only argument.

This method is responsible for parsing the response data and extracting scraped data (as scraped items) and more

URLs to follow.

The parse() method is in charge of processing the response and returning scraped data (as Item objects) and

more URLs to follow (as Request objects).

This is the code for our ﬁrst Spider; save it in a ﬁle named dmoz_spider.py under the dmoz/spiders directory:

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

12 Chapter 2. First steps

Scrapy Documentation, Release 0.14.4

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

filename = response.url.split("/")[-2]

open(filename, ’wb’).write(response.body)

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

The crawl dmoz command runs the spider for the dmoz.org domain. You will get an output similar to this:

2008-08-20 03:51:13-0300 [scrapy] INFO: Started project: dmoz

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled extensions: ...

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled downloader middlewares: ...

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled spider middlewares: ...

2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled item pipelines: ...

2008-08-20 03:51:14-0300 [dmoz] INFO: Spider opened

2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)

2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)

2008-08-20 03:51:14-0300 [dmoz] INFO: Spider closed (finished)

Pay attention to the lines containing [dmoz], which corresponds to our spider. You can see a log line for each URL

deﬁned in start_urls. Because these URLs are the starting ones, they have no referrers, which is shown at the end

of the log line, where it says (referer: <None>).

But more interesting, as our parse method instructs, two ﬁles have been created: Books and Resources, with the

content of both URLs.

What just happened under the hood?

Scrapy creates scrapy.http.Request objects for each URL in the start_urls attribute of the Spider, and

assigns them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath expressions called

XPath selectors. For more information about selectors and other extraction mechanisms see the XPath selectors docu-

mentation.

Here are some examples of XPath expressions and their meanings:

• /html/head/title: selects the <title> element, inside the <head> element of a HTML document

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 0.14.4

• //td: selects all the <td> elements

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath we recommend this XPath tutorial.

For working with XPaths, Scrapy provides a XPathSelector class, which comes in two ﬂavours,

HtmlXPathSelector (for HTML data) and XmlXPathSelector (for XML data). In order to use them you

must instantiate the desired class with a Response object.

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated to the root node, or the entire document.

Selectors have three methods (click on the method to see the complete API documentation).

• select(): returns a list of selectors, each of them representing the nodes selected by the xpath expression

given as argument.

• extract(): returns a unicode string with the data selected by the XPath selector.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

This is what the shell looks like:

[ ... Scrapy log here ... ]

[s] Available Scrapy objects:

[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)

[s] hxs <HtmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>

[s] item Item()

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] spider <BaseSpider ’default’ at 0x1b6c2d0>

[s] xxs <XmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>

[s] Useful shortcuts:

[s] shelp() Print this help

[s] fetch(req_or_url) Fetch a new request or URL and update shell objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

The shell also instantiates two selectors, one for HTML (in the hxs variable) and one for XML (in the xxs variable)

with this response. So let’s try them:

In [1]: hxs.select(’//title’)

Out[1]: [<HtmlXPathSelector (title) xpath=//title>]

14 Chapter 2. First steps

Scrapy Documentation, Release 0.14.4

In [2]: hxs.select(’//title’).extract()

Out[2]: [u’<title>Open Directory - Computers: Programming: Languages: Python: Books</title>’]

In [3]: hxs.select(’//title/text()’)

Out[3]: [<HtmlXPathSelector (text) xpath=//title/text()>]

In [4]: hxs.select(’//title/text()’).extract()

Out[4]: [u’Open Directory - Computers: Programming: Languages: Python: Books’]

In [5]: hxs.select(’//title/text()’).re(’(\w+):’)

Out[5]: [u’Computers’, u’Programming’, u’Languages’, u’Python’]

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to ﬁgure out the XPaths you need to

use. However, inspecting the raw HTML code there could become a very tedious task. To make this an easier task,

you can use some Firefox extensions like Firebug. For more information see Using Firebug for scraping and Using

Firefox for scraping.

After inspecting the page source, you’ll ﬁnd that the web sites information is inside a <ul> element, in fact the second

<ul> element.

So we can select each <li> element belonging to the sites list with this code:

hxs.select(’//ul/li’)

And from them, the sites descriptions:

hxs.select(’//ul/li/text()’).extract()

The sites titles:

hxs.select(’//ul/li/a/text()’).extract()

And the sites links:

hxs.select(’//ul/li/a/@href’).extract()

As we said before, each select() call returns a list of selectors, so we can concatenate further select() calls to

dig deeper into a node. We are going to use that property here, so:

sites = hxs.select(’//ul/li’)

for site in sites:

title = site.select(’a/text()’).extract()

link = site.select(’a/@href’).extract()

desc = site.select(’text()’).extract()

print title, link, desc

Note: For a more detailed description of using nested selectors, see Nesting selectors and Working with relative

XPaths in the XPath Selectors documentation

Let’s add this code to our spider:

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

2.3. Scrapy Tutorial 15

剩余170页未读，继续阅读

hitzheng

粉丝: 4
资源: 19

Python网络爬虫框架Scrapy入门指南

scrapy中文教程（官方）

scrapy 中文教程

scrapy-1.4.pdf

精通Python爬虫框架Scrapy.pdf

开源python网络爬虫框架Scrapy.pdf

大数据爬取、清洗与可视化教程课件第六章中型网络爬虫框架Scrapy.pdf

scrapy1.6.pdf

开源python网络爬虫框架Scrapy定义.pdf

开源python网络爬虫框架Scrapy借鉴.pdf

开源python网络爬虫框架Scrapy资料.pdf

最新资源