Scrapy框架官方文档：从入门到精通

需积分: 9 116 浏览量更新于2024-07-16 收藏 1.18MB PDF 举报

"Scrapy.pdf 是一份关于 Scrapy 框架的电子版高清文档，主要涵盖 Scrapy 的基本概念、安装指南、教程、内置服务以及解决特定问题的方法等内容，适用于 Python 开发者进行 web 爬虫开发学习。" Scrapy 是一个用 Python 编写的高效且强大的 web 爬虫框架，它提供了许多开箱即用的功能，如爬取、解析网页、处理数据等。这份文档是 Scrapy 的 Release 1.6.0 版本，由 Scrapy 的开发者于 2019 年发布。文档的第一部分介绍了初学者如何开始使用 Scrapy。首先，Scrapy 概览让你快速了解其核心概念。接着，安装指南详细说明了在不同操作系统上安装 Scrapy 的步骤。Scrapy 教程带领你逐步创建并运行你的第一个爬虫项目。此外，文档还包含多个示例，帮助读者更好地理解和实践 Scrapy 的用法。在基本概念章节中，涵盖了命令行工具的使用，这是与 Scrapy 交互的主要方式。Spiders 是 Scrapy 的核心组件，用于定义爬取规则和数据解析逻辑。Selectors（选择器）借鉴了 XPath 和 CSS 语法，用于从 HTML 或 XML 文档中提取数据。Items 代表你想要抓取的数据结构，而 ItemLoaders 则方便地将数据填充到 Items 中。Scrapy Shell 提供了一个交互式环境，便于测试和调试选择器和解析逻辑。 Item Pipeline 是 Scrapy 的数据处理流程，负责清洗、验证和存储抓取到的数据。Feed Exports 功能则可以将爬取结果导出为各种格式。Requests and Responses 部分解释了网络请求和响应对象，它们是爬虫获取网页数据的基础。Link Extractors 用于自动提取网页中的链接，方便进行深度爬取。Settings 部分介绍如何自定义 Scrapy 项目的配置。最后，异常处理部分阐述了 Scrapy 中常见的错误和异常情况。内置服务章节涉及日志记录、统计收集、邮件发送、telnet 控制台和 web 服务，这些都是 Scrapy 提供的便捷工具，帮助开发者监控和控制爬虫运行状态。在解决特定问题部分，文档提供了 FAQ、调试蜘蛛的方法、Spider Contracts（用于确保爬虫行为的一致性）、最佳实践、宽广爬虫策略、使用浏览器开发者工具辅助爬虫开发、内存泄漏调试、下载和处理文件及图片、部署爬虫、AutoThrottle 扩展（自动调整请求速率）、基准测试、暂停和恢复爬虫的 Jobs 功能。这些内容旨在帮助开发者解决实际开发中遇到的问题。通过这份详尽的文档，Python 开发者能够深入理解 Scrapy 框架，有效地构建和维护自己的 web 爬虫项目。

Scrapy Documentation, Release 1.6.0

You can also take a look at this list of Python resources for non-programmers, as well as the suggested resources in

the learnpython-subreddit.

2.3.1 Creating a project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store

your code and run:

scrapy startproject tutorial

This will create a tutorial directory with the following contents:

tutorial/

scrapy.cfg # deploy configuration file

tutorial/ # project's Python module, you'll import your code from here

__init__.py

items.py # project items definition file

middlewares.py # project middlewares file

pipelines.py # project pipelines file

settings.py # project settings file

spiders/ # a directory where you'll later put your spiders

__init__.py

2.3.2 Our ﬁrst Spider

Spiders are classes that you deﬁne and that Scrapy uses to scrape information from a website (or a group of websites).

They must subclass scrapy.Spider and deﬁne the initial requests to make, optionally how to follow links in the

pages, and how to parse the downloaded page content to extract data.

This is the code for our ﬁrst Spider. Save it in a ﬁle named quotes_spider.py under the tutorial/spiders

directory in your project:

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

def start_requests(self):

urls = [

'http://quotes.toscrape.com/page/1/',

'http://quotes.toscrape.com/page/2/',

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

page = response.url.split("/")[-2]

filename = 'quotes-%s.html' % page

(continues on next page)

12 Chapter 2. First steps

Scrapy Documentation, Release 1.6.0

(continued from previous page)

with open(filename, 'wb') as f:

f.write(response.body)

self.log('Saved file %s' % filename)

As you can see, our Spider subclasses scrapy.Spider and deﬁnes some attributes and methods:

• name: identiﬁes the Spider. It must be unique within a project, that is, you can’t set the same name for different

Spiders.

• start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator

function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from

these initial requests.

• parse(): a method that will be called to handle the response downloaded for each of the requests made.

The response parameter is an instance of TextResponse that holds the page content and has further helpful

methods to handle it.

The parse() method usually parses the response, extracting the scraped data as dicts and also ﬁnding new

URLs to follow and creating new requests (Request) from them.

How to run our spider

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl quotes

This command runs the spider with name quotes that we’ve just added, that will send some requests for the

quotes.toscrape.com domain. You will get an output similar to this:

... (omitted for brevity)

2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened

2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/

˓→min), scraped 0 items (at 0 items/min)

2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.

˓→0.0.1:6023

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.

˓→toscrape.com/robots.txt> (referer: None)

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.

˓→toscrape.com/page/1/> (referer: None)

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.

˓→toscrape.com/page/2/> (referer: None)

2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html

2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html

2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)

...

Now, check the ﬁles in the current directory. You should notice that two new ﬁles have been created: quotes-1.html

and quotes-2.html, with the content for the respective URLs, as our parse method instructs.

Note: If you are wondering why we haven’t parsed the HTML yet, hold on, we will cover that soon.

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 1.6.0

What just happened under the hood?

Scrapy schedules the scrapy.Request objects returned by the start_requests method of the Spider. Upon

receiving a response for each one, it instantiates Response objects and calls the callback method associated with the

request (in this case, the parse method) passing the response as argument.

A shortcut to the start_requests method

Instead of implementing a start_requests() method that generates scrapy.Request objects from URLs,

you can just deﬁne a start_urls class attribute with a list of URLs. This list will then be used by the default

implementation of start_requests() to create the initial requests for your spider:

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

'http://quotes.toscrape.com/page/1/',

'http://quotes.toscrape.com/page/2/',

]

def parse(self, response):

page = response.url.split("/")[-2]

filename = 'quotes-%s.html' % page

with open(filename, 'wb') as f:

f.write(response.body)

The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly

told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests

without an explicitly assigned callback.

Extracting data

The best way to learn how to extract data with Scrapy is trying selectors using the shell Scrapy shell. Run:

scrapy shell 'http://quotes.toscrape.com/page/1/'

Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls

containing arguments (ie. & character) will not work.

On Windows, use double quotes instead:

scrapy shell "http://quotes.toscrape.com/page/1/"

You will see something like:

[ ... Scrapy log here ... ]

2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.

˓→toscrape.com/page/1/> (referer: None)

[s] Available Scrapy objects:

[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)

[s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>

(continues on next page)

14 Chapter 2. First steps

Scrapy Documentation, Release 1.6.0

(continued from previous page)

[s] item {}

[s] request <GET http://quotes.toscrape.com/page/1/>

[s] response <200 http://quotes.toscrape.com/page/1/>

[s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>

[s] spider <DefaultSpider 'default' at 0x7fa91c8af990>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

>>>

Using the shell, you can try selecting elements using CSS with the response object:

>>> response.css('title')

[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

The result of running response.css('title') is a list-like object called SelectorList, which represents a

list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to ﬁne-grain

the selection or extract the data.

To extract the text from the title above, you can do:

>>> response.css('title::text').getall()

['Quotes to Scrape']

There are two things to note here: one is that we’ve added ::text to the CSS query, to mean we want to select

only the text elements directly inside <title> element. If we don’t specify ::text, we’d get the full title element,

including its tags:

>>> response.css('title').getall()

['<title>Quotes to Scrape</title>']

The other thing is that the result of calling .getall() is a list: it is possible that a selector returns more than one

result, so we extract them all. When you know you just want the ﬁrst result, as in this case, you can do:

>>> response.css('title::text').get()

'Quotes to Scrape'

As an alternative, you could’ve written:

>>> response.css('title::text')[0].get()

'Quotes to Scrape'

However, using .get() directly on a SelectorList instance avoids an IndexError and returns None when it

doesn’t ﬁnd any element matching the selection.

There’s a lesson here: for most scraping code, you want it to be resilient to errors due to things not being found on a

page, so that even if some parts fail to be scraped, you can at least get some data.

Besides the getall() and get() methods, you can also use the re() method to extract using regular expressions:

>>> response.css('title::text').re(r'Quotes.

['Quotes to Scrape']

>>> response.css('title::text').re(r'Q\w+')

['Quotes']

>>> response.css('title::text').re(r'(\w+) to (\w+)')

['Quotes', 'Scrape']

2.3. Scrapy Tutorial 15

Scrapy Documentation, Release 1.6.0

In order to ﬁnd the proper CSS selectors to use, you might ﬁnd useful opening the response page from the shell in your

web browser using view(response). You can use your browser developer tools to inspect the HTML and come

up with a selector (see section about Using your browser’s Developer Tools for scraping).

Selector Gadget is also a nice tool to quickly ﬁnd CSS selector for visually selected elements, which works in many

browsers.

XPath: a brief intro

Besides CSS, Scrapy selectors also support using XPath expressions:

>>> response.xpath('//title')

[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]

>>> response.xpath('//title/text()').get()

'Quotes to Scrape'

XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted

to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the

shell.

While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the

structure, it can also look at the content. Using XPath, you’re able to select things like: select the link that contains the

text “Next Page”. This makes XPath very ﬁtting to the task of scraping, and we encourage you to learn XPath even if

you already know how to construct CSS selectors, it will make scraping much easier.

We won’t cover much of XPath here, but you can read more about using XPath with Scrapy Selectors here. To learn

more about XPath, we recommend this tutorial to learn XPath through examples, and this tutorial to learn “how to

think in XPath”.

Extracting quotes and authors

Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the

quotes from the web page.

Each quote in http://quotes.toscrape.com is represented by HTML elements that look like this:

<span class="text">“The world as we have created it is a process of our

thinking. It cannot be changed without changing our thinking.”</span>

<span>

by <small class="author">Albert Einstein</small>

<a href="/author/Albert-Einstein">(about)</a>

</span>

Tags:

<a class="tag" href="/tag/change/page/1/">change</a>

<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>

<a class="tag" href="/tag/thinking/page/1/">thinking</a>

<a class="tag" href="/tag/world/page/1/">world</a>

</div>

Let’s open up scrapy shell and play a bit to ﬁnd out how to extract the data we want:

$ scrapy shell 'http://quotes.toscrape.com'

16 Chapter 2. First steps

剩余294页未读，继续阅读

qq_33692803

粉丝: 15
资源: 7

Scrapy框架官方文档：从入门到精通

Scrapy：Python开源网络爬虫框架解析

Scrapy框架详解与开发指南

Scrapy：Python的开源网络爬虫框架解析

精通Python爬虫框架Scrapy.pdf

开源python网络爬虫框架Scrapy.pdf

大数据爬取、清洗与可视化教程课件第六章中型网络爬虫框架Scrapy.pdf

scrapy1.6.pdf

开源python网络爬虫框架Scrapy定义.pdf

开源python网络爬虫框架Scrapy借鉴.pdf

开源python网络爬虫框架Scrapy资料.pdf

最新资源