Scrapy 0.12入门教程：快速上手与实战指南

5星 · 超过95%的资源需积分: 10 107 浏览量更新于2024-07-30 3 收藏 1.1MB PDF 举报

Scrapy爬虫教程是一份针对Scrapy 0.12版本的详细指南，适合初学者入门。Scrapy是一款强大的Python网络爬虫框架，特别适用于Web数据抓取和处理。该教程由Insophia发布于2011年9月18日，内容覆盖了从基础到进阶的各个方面。首先，教程提供了获取帮助的方法，包括官方文档和在线资源，以便用户在遇到问题时能够及时找到解决方案。对于初次接触Scrapy的读者，章节2“First steps”介绍了框架的基本概念，如Scrapy概述，安装步骤，以及如何快速上手编写爬虫。在“Scraping basics”部分，教程详细讲解了关键组件： 1. **Command line tool**：介绍了命令行工具的使用，这是管理和控制Scrapy爬虫项目的便捷方式。 2. **Items**：重点阐述了如何定义和处理抓取的数据结构，Items是存储和解析网页内容的核心元素。 3. **Spiders**：这部分深入解析了Spider的创建与编写，它是Scrapy的核心组成部分，负责爬取指定网站并执行提取数据的任务。 4. **Link Extractors** 和 **XPath Selectors**：讲解了如何识别网页中的链接和选择器，以便有效定位目标内容。 5. **Item Loaders**：介绍如何通过Item Loaders更高效地填充Items，同时支持多种数据源。 6. **Scrapy Shell**：一个交互式环境，用于实时测试XPath表达式和Item Loaders。 7. **Item Pipeline**：展示了如何组织数据处理流程，如清洗、转换和存储抓取结果。 8. **Feed exports**：涵盖了如何将抓取数据导出到各种格式，如JSON、CSV等。随后的章节介绍了内置服务，如日志管理（Logging）、统计收集（Stats Collection）、邮件发送（Sending e-mails）、telnet控制台（Telnet Console）以及与Web服务的交互。这部分内容对于理解和优化爬虫运行过程至关重要。 “Solving specific problems”部分则针对常见问题提供了解决方案，例如如何使用Firefox或Firebug辅助抓取，处理内存泄漏，下载网页图片，以及如何在Ubuntu等操作系统上管理和部署Scrapy项目。教程的最后两个部分探讨了Scrapy的扩展性：如何理解其架构，以及如何编写和配置下载中间件、爬虫中间件、扩展等。此外，还有对核心概念的参考文档，如请求和响应、设置、信号、异常处理以及Item Exporters的深入解析。这份Scrapy爬虫教程是学习和实践Scrapy框架的强大资源，无论是初学者还是有一定经验的开发者，都能从中受益匪浅。通过逐步掌握其中的知识点，读者可以构建出高效、稳定的网络爬虫系统。

Scrapy Documentation, Release 0.12.0

2.3.2 Deﬁning our Item

Items are containers that will be loaded with the scraped data; they work like simple python dicts but they offer some

additional features like providing default values.

They are declared by creating an scrapy.item.Item class an deﬁning its attributes as scrapy.item.Field

objects, like you will in an ORM (don’t worry if you’re not familiar with ORMs, you will see that this is an easy task).

We begin by modeling the item that we will use to hold the sites data obtained from dmoz.org, as we want to capture

the name, url and description of the sites, we deﬁne ﬁelds for each of these three attributes. To do that, we edit items.py,

found in the dmoz directory. Our Item class looks like this:

# Define here the models for your scraped items

from scrapy.item import Item, Field

class DmozItem(Item):

title = Field()

link = Field()

desc = Field()

This may seem complicated at ﬁrst, but deﬁning the item allows you to use other handy components of Scrapy that

need to know how your item looks like.

2.3.3 Our ﬁrst Spider

Spiders are user-written classes used to scrape information from a domain (or group of domains).

They deﬁne an initial list of URLs to download, how to follow links, and how to parse the contents of those pages to

extract items.

To create a Spider, you must subclass scrapy.spider.BaseSpider, and deﬁne the three main, mandatory,

attributes:

• name: identiﬁes the Spider. It must be unique, that is, you can’t set the same name for different Spiders.

• start_urls: is a list of URLs where the Spider will begin to crawl from. So, the ﬁrst pages downloaded

will be those listed here. The subsequent URLs will be generated successively from data contained in the start

URLs.

• parse() is a method of the spider, which will be called with the downloaded Response object of each start

URL. The response is passed to the method as the ﬁrst and only argument.

This method is responsible for parsing the response data and extracting scraped data (as scraped items) and more

URLs to follow.

The parse() method is in charge of processing the response and returning scraped data (as Item objects) and

more URLs to follow (as Request objects).

This is the code for our ﬁrst Spider; save it in a ﬁle named dmoz_spider.py under the dmoz/spiders directory:

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):

name = "dmoz.org"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

12 Chapter 2. First steps

Scrapy Documentation, Release 0.12.0

def parse(self, response):

filename = response.url.split("/")[-2]

open(filename, ’wb’).write(response.body)

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz.org

The crawl dmoz.org command runs the spider for the dmoz.org domain. You will get an output similar to this:

2008-08-20 03:51:13-0300 [scrapy] INFO: Started project: dmoz

2008-08-20 03:51:13-0300 [dmoz] INFO: Enabled extensions: ...

2008-08-20 03:51:13-0300 [dmoz] INFO: Enabled scheduler middlewares: ...

2008-08-20 03:51:13-0300 [dmoz] INFO: Enabled downloader middlewares: ...

2008-08-20 03:51:13-0300 [dmoz] INFO: Enabled spider middlewares: ...

2008-08-20 03:51:13-0300 [dmoz] INFO: Enabled item pipelines: ...

2008-08-20 03:51:14-0300 [dmoz.org] INFO: Spider opened

2008-08-20 03:51:14-0300 [dmoz.org] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)

2008-08-20 03:51:14-0300 [dmoz.org] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)

2008-08-20 03:51:14-0300 [dmoz.org] INFO: Spider closed (finished)

Pay attention to the lines containing [dmoz.org], which corresponds to our spider (identiﬁed by the domain

"dmoz.org"). You can see a log line for each URL deﬁned in start_urls. Because these URLs are the starting

ones, they have no referrers, which is shown at the end of the log line, where it says (referer: <None>).

But more interesting, as our parse method instructs, two ﬁles have been created: Books and Resources, with the

content of both URLs.

What just happened under the hood?

Scrapy creates scrapy.http.Request objects for each URL in the start_urls attribute of the Spider, and

assigns them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath expressions called

XPath selectors. For more information about selectors and other extraction mechanisms see the XPath selectors docu-

mentation.

Here are some examples of XPath expressions and their meanings:

• /html/head/title: selects the <title> element, inside the <head> element of a HTML document

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

• //td: selects all the <td> elements

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 0.12.0

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath we recommend this XPath tutorial.

For working with XPaths, Scrapy provides a XPathSelector class, which comes in two ﬂavours,

HtmlXPathSelector (for HTML data) and XmlXPathSelector (for XML data). In order to use them you

must instantiate the desired class with a Response object.

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated to the root node, or the entire document.

Selectors have three methods (click on the method to see the complete API documentation).

• select(): returns a list of selectors, each of them representing the nodes selected by the xpath expression

given as argument.

• extract(): returns a unicode string with the data selected by the XPath selector.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

This is what the shell looks like:

[ ... Scrapy log here ... ]

[s] Available Scrapy objects:

[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)

[s] hxs <HtmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>

[s] item Item()

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] spider <BaseSpider ’default’ at 0x1b6c2d0>

[s] xxs <XmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>

[s] Useful shortcuts:

[s] shelp() Print this help

[s] fetch(req_or_url) Fetch a new request or URL and update shell objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

The shell also instantiates two selectors, one for HTML (in the hxs variable) and one for XML (in the xxs variable)

with this response. So let’s try them:

In [1]: hxs.select(’/html/head/title’)

Out[1]: [<HtmlXPathSelector (title) xpath=/html/head/title>]

In [2]: hxs.select(’/html/head/title’).extract()

Out[2]: [u’<title>Open Directory - Computers: Programming: Languages: Python: Books</title>’]

14 Chapter 2. First steps

Scrapy Documentation, Release 0.12.0

In [3]: hxs.select(’/html/head/title/text()’)

Out[3]: [<HtmlXPathSelector (text) xpath=/html/head/title/text()>]

In [4]: hxs.select(’/html/head/title/text()’).extract()

Out[4]: [u’Open Directory - Computers: Programming: Languages: Python: Books’]

In [5]: hxs.select(’/html/head/title/text()’).re(’(\w+):’)

Out[5]: [u’Computers’, u’Programming’, u’Languages’, u’Python’]

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to ﬁgure out the XPaths you need to

use. However, inspecting the raw HTML code there could become a very tedious task. To make this an easier task,

you can use some Firefox extensions like Firebug. For more information see Using Firebug for scraping and Using

Firefox for scraping.

After inspecting the page source, you’ll ﬁnd that the web sites information is inside a <ul> element, in fact the second

<ul> element.

So we can select each <li> element belonging to the sites list with this code:

hxs.select(’//ul/li’)

And from them, the sites descriptions:

hxs.select(’//ul/li/text()’).extract()

The sites titles:

hxs.select(’//ul/li/a/text()’).extract()

And the sites links:

hxs.select(’//ul/li/a/@href’).extract()

As we said before, each select() call returns a list of selectors, so we can concatenate further select() calls to

dig deeper into a node. We are going to use that property here, so:

sites = hxs.select(’//ul/li’)

for site in sites:

title = site.select(’a/text()’).extract()

link = site.select(’a/@href’).extract()

desc = site.select(’text()’).extract()

print title, link, desc

Note: For a more detailed description of using nested selectors, see Nesting selectors and Working with relative

XPaths in the XPath Selectors documentation

Let’s add this code to our spider:

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

class DmozSpider(BaseSpider):

name = "dmoz.org"

2.3. Scrapy Tutorial 15

剩余163页未读，继续阅读

scrapy

粉丝: 1
资源: 1

Scrapy 0.12入门教程：快速上手与实战指南

scrapy 教程

scrapy爬虫

scrapy 爬虫

Scrapy框架简介与安装+Scrapy核心组件详解+Scrapy数据抓取流程+编写第一个Scrapy爬虫+Scrapy爬虫教程

Scrapy 爬虫教程实践

scrapy爬虫教程（一）–scrapy安装及生成项目

Python Scrapy爬虫教程：数据存储到数据库

20分钟爬取10万股吧评论的Scrapy爬虫教程

Scrapy爬虫教程：抓取books.toscrape.com书籍信息并存入CSV

python scrapy 爬虫基础 分布式爬虫 scrapy 教程【5.3G】_python scrapy教程

最新资源

python scrapy 爬虫基础分布式爬虫 scrapy 教程【5.3G】_python scrapy教程