Scrapy英文文档：入门与核心概念详解

需积分: 9 67 浏览量更新于2024-07-19 收藏 1.05MB PDF 举报

"Scrapy的纯英文文档是一个详细介绍Scrapy框架的资料，版本为1.0.7，由Scrapy开发者发布，日期为2017年8月14日。文档涵盖了从入门到进阶的多个方面，适合程序员学习使用。" Scrapy是一个用Python编写的开源Web抓取框架，它提供了强大的功能来处理网页数据，广泛应用于数据挖掘、内容抓取和自动化测试等领域。这份纯英文文档是Scrapy用户的重要参考资料，它包含了以下关键知识点： 1. **First Steps**：这部分引导新手了解Scrapy，包括对Scrapy的概述，安装指南，Scrapy教程以及一系列示例，帮助初学者快速上手。 2. **基本概念**： - **命令行工具**：介绍如何通过命令行与Scrapy交互，创建项目、运行爬虫等操作。 - **Spiders**：Spiders是Scrapy的核心组件，用于定义如何抓取网页和解析响应数据。 - **Selectors**：Scrapy使用XPath或CSS选择器来抽取和处理网页数据。 - **Items**：定义要抓取的数据结构，便于后续处理。 - **ItemLoaders**：简化了将数据填充到Items中的过程，允许进行数据清洗和转换。 - **Scrapy Shell**：一个交互式的命令行工具，用于测试和调试选择器。 - **Item Pipeline**：处理Items的流程，可以进行数据清洗、验证和存储。 - **Feed Exports**：支持将抓取的数据导出到各种格式，如CSV、JSON等。 - **Requests and Responses**：请求是向网站发送的HTTP请求，响应则是服务器返回的数据。 - **Link Extractors**：用于自动提取网页中的链接，方便爬虫遍历网站。 - **Settings**：配置Scrapy项目的全局选项。 - **Exceptions**：列出Scrapy中可能遇到的异常情况及其处理方法。 3. **内置服务**： - **Logging**：提供日志记录功能，便于追踪和调试。 - **Stats Collection**：收集爬虫运行时的统计信息。 - **Sending E-mail**：发送邮件通知，例如在爬虫完成或出错时。 - **Telnet Console**：通过telnet终端连接到Scrapy的控制台进行实时监控。 - **Web Service**：提供RESTful API接口，以远程控制Scrapy爬虫。 4. **解决特定问题**： - **FAQ**：常见问题解答，为用户解决实际使用中遇到的问题。 - **Debugging Spiders**：调试技巧，帮助找出爬虫中的错误。 - **Spiders Contracts**：一种确保爬虫行为一致性的方法。 - **Common Practices**：推荐的最佳实践，提高开发效率和代码质量。 - **Broad Crawls**：处理大规模爬虫的策略。 - **Using Firefox for Scraping**：利用Firefox浏览器进行网页抓取。 - **Using Firebug for Scraping**：使用Firebug插件辅助分析网页结构。 - **Debugging Memory Leaks**：检测和修复内存泄漏问题。 - **Downloading and Processing Files and Images**：处理下载的文件和图片，如图片存储、文件验证等。 - **Ubuntu Packages**：在Ubuntu系统中安装和管理Scrapy的方法。 - **Deploying Spiders**：如何部署和运行Scrapy爬虫。 - **Auto Throttle extension**：自动调整请求速率，防止被目标网站封锁。 - **Benchmarking**：性能测试，评估爬虫效率。 - **Jobs**：可能涉及如何处理后台任务和调度。这份文档详尽地介绍了Scrapy的各个方面，无论是初学者还是有经验的开发者，都能从中受益，更好地掌握Scrapy的使用和优化技巧。

Scrapy Documentation, Release 1.0.7

This method is responsible for parsing the response data and extracting scraped data (as scraped items) and more

URLs to follow.

The parse() method is in charge of processing the response and returning scraped data (as Item objects) and

more URLs to follow (as Request objects).

This is the code for our ﬁrst Spider; save it in a ﬁle named dmoz_spider.py under the tutorial/spiders

directory:

import scrapy

class DmozSpider(scrapy.Spider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

filename = response.url.split("/")[-2] + '.html'

with open(filename, 'wb') as f:

f.write(response.body)

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

This command runs the spider with name dmoz that we’ve just added, that will send some requests for the dmoz.org

domain. You will get an output similar to this:

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)

2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Spider opened

2014-01-23 18:13:08-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/

˓→Computers/Programming/Languages/Python/Resources/> (referer: None)

2014-01-23 18:13:09-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/

˓→Computers/Programming/Languages/Python/Books/> (referer: None)

2014-01-23 18:13:09-0400 [scrapy] INFO: Closing spider (finished)

Note: At the end you can see a log line for each URL deﬁned in start_urls. Because these URLs are the starting

ones, they have no referrers, which is shown at the end of the log line, where it says (referer: None).

Now, check the ﬁles in the current directory. You should notice two new ﬁles have been created: Books.html and

Resources.html, with the content for the respective URLs, as our parse method instructs.

12 Chapter 2. First steps

Scrapy Documentation, Release 1.0.7

What just happened under the hood?

Scrapy creates scrapy.Request objects for each URL in the start_urls attribute of the Spider, and assigns

them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath or CSS expressions

called Scrapy Selectors. For more information about selectors and other extraction mechanisms see the Selectors

documentation.

Here are some examples of XPath expressions and their meanings:

• /html/head/title: selects the <title> element, inside the <head> element of an HTML document

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

• //td: selects all the <td> elements

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath, we recommend this tutorial to learn XPath through examples, and this

tutorial to learn “how to think in XPath”.

Note: CSS vs XPath: you can go a long way extracting data from web pages using only CSS selectors. However,

XPath offers more power because besides navigating the structure, it can also look at the content: you’re able to select

things like: the link that contains the text ‘Next Page’. Because of this, we encourage you to learn about XPath even if

you already know how to construct CSS selectors.

For working with CSS and XPath expressions, Scrapy provides Selector class and convenient shortcuts to avoid

instantiating selectors yourself every time you need to select something from a response.

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated with the root node, or the entire document.

Selectors have four basic methods (click on the method to see the complete API documentation):

• xpath(): returns a list of selectors, each of which represents the nodes selected by the xpath expression given

as argument.

• css(): returns a list of selectors, each of which represents the nodes selected by the CSS expression given as

argument.

• extract(): returns a unicode string with the selected data.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 1.0.7

To start a shell, you must go to the project’s top level directory and run:

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls

containing arguments (ie. & character) will not work.

This is what the shell looks like:

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/

˓→Computers/Programming/Languages/Python/Books/> (referer: None)

[s] Available Scrapy objects:

[s] crawler <scrapy.crawler.Crawler object at 0x3636b50>

[s] item {}

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/

˓→Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/

˓→Books/>

[s] settings <scrapy.settings.Settings object at 0x3fadc50>

[s] spider <Spider 'default' at 0x3cebf50>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type response.

body you will see the body of the response, or you can type response.headers to see its headers.

More importantly response has a selector attribute which is an instance of Selector class, instantiated with

this particular response. You can run queries on response by calling response.selector.xpath()

or response.selector.css(). There are also some convenience shortcuts like response.xpath() or

response.css() which map directly to response.selector.xpath() and response.selector.

css().

So let’s try it:

In [1]: response.xpath('//title')

Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]

In [2]: response.xpath('//title').extract()

Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</

˓→title>']

In [3]: response.xpath('//title/text()')

Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers:

˓→Programming:'>]

In [4]: response.xpath('//title/text()').extract()

Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']

In [5]: response.xpath('//title/text()').re('(\w+):')

Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']

14 Chapter 2. First steps

剩余243页未读，继续阅读

realwuxiong

粉丝: 8
资源: 119

Scrapy英文文档：入门与核心概念详解

scrapy0.22 API英文版

英文文本处理

Scrapy文档1.4.0 文档

scrapy1.5中文文档

scrapy1.1 帮助文档

scrapy-0.24文档

Scrapy0.24.1 中文文档

scrapy 0.25中文文档

Scrapy 0.24英文文档详解：快速Web抓取与数据提取指南

python scrapy电子书开发文档

最新资源