Scrapy 0.23.0：Web爬虫框架文档

需积分: 7 160 浏览量更新于2024-07-21 收藏 888KB PDF 举报

"Scrapy 0.22.3版本的官方文档" Scrapy是一个用于爬取网站并提取结构化数据的框架，适用于各种用途，如数据挖掘、信息处理和历史归档。它最初是为了网页抓取（更具体地说是网络抓取）而设计的，但也可以用于通过API（如亚马逊关联网络服务）提取数据，甚至作为通用的网络爬虫工具。在Scrapy的0.23.0版本文档中，提供了全面的帮助指南，旨在帮助用户解决遇到的问题。文档包含了以下主要内容： 1. 获取帮助： - 首先，建议查阅FAQ（常见问题解答），里面包含了一些常见问题的答案。 - 如果需要特定的信息，可以尝试使用通用索引或模块索引进行搜索。 - 用户还可以在Scrapy的邮件列表存档中查找信息，或者直接在邮件列表中提出问题。 - 在#scrapy的IRC频道中提问，可以获得实时的帮助。 - 如果发现Scrapy的bug，可以在其问题追踪器中报告。 2. 第一步： - 第二章“First steps”介绍了Scrapy的基本概念，其中2.1节“Scrapy概述”详细阐述了Scrapy是一个应用框架，主要用于爬取网站并提取可用于多种用途的结构化数据。文档的结构清晰，包括了多个章节，每个章节都深入讲解了Scrapy的不同方面。尽管给出的内容只是文档的一部分，但可以推测完整文档会涵盖Scrapy的安装、项目设置、爬虫创建、中间件、下载器、选择器、调度器、持久化存储以及错误处理等多个主题。此外，Scrapy的文档还可能包含了关于如何调试、优化爬虫性能、处理反爬策略（如User-Agent和Cookies管理）、国际化支持以及如何贡献代码到Scrapy项目等高级主题。 Scrapy 0.23.0版本的文档为开发者提供了一个全面的学习和参考资源，无论你是初学者还是经验丰富的爬虫开发者，都可以从中找到你需要的信息和指导。

Scrapy Documentation, Release 0.23.0

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

filename = response.url.split("/")[-2]

with open(filename, ’wb’) as f:

f.write(response.body)

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

The crawl dmoz command runs the spider for the dmoz.org domain. You will get an output similar to this:

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)

2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...

2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened

2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

2014-01-23 18:13:09-0400 [dmoz] INFO: Closing spider (finished)

Pay attention to the lines containing [dmoz], which corresponds to our spider. You can see a log line for each URL

deﬁned in start_urls. Because these URLs are the starting ones, they have no referrers, which is shown at the end

of the log line, where it says (referer: None).

But more interesting, as our parse method instructs, two ﬁles have been created: Books and Resources, with the

content of both URLs.

What just happened under the hood?

Scrapy creates scrapy.Request objects for each URL in the start_urls attribute of the Spider, and assigns

them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed

back to the spider, through the parse() method.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath or CSS expressions

called Scrapy Selectors. For more information about selectors and other extraction mechanisms see the Selectors

documentation.

Here are some examples of XPath expressions and their meanings:

• /html/head/title: selects the <title> element, inside the <head> element of a HTML document

12 Chapter 2. First steps

Scrapy Documentation, Release 0.23.0

• /html/head/title/text(): selects the text inside the aforementioned <title> element.

• //td: selects all the <td> elements

• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much

more powerful. To learn more about XPath we recommend this XPath tutorial.

For working with XPaths, Scrapy provides Selector class and convenient shortcuts to avoid instantiating selectors

yourself everytime you need to select something from a response.

You can see selectors as objects that represent nodes in the document structure. So, the ﬁrst instantiated selectors are

associated with the root node, or the entire document.

Selectors have four basic methods (click on the method to see the complete API documentation):

• xpath(): returns a list of selectors, each of them representing the nodes selected by the xpath expression given

as argument.

• css(): returns a list of selectors, each of them representing the nodes selected by the CSS expression given as

argument.

• extract(): returns a unicode string with the selected data.

• re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended

Python console) installed on your system.

To start a shell, you must go to the project’s top level directory and run:

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Note: Remember to always enclose urls with quotes when running Scrapy shell from command-line, otherwise urls

containing arguments (ie. & character) will not work.

This is what the shell looks like:

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [default] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

[s] Available Scrapy objects:

[s] crawler <scrapy.crawler.Crawler object at 0x3636b50>

[s] item {}

[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

[s] settings <CrawlerSettings module=None>

[s] spider <Spider ’default’ at 0x3cebf50>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type

response.body you will see the body of the response, or you can type response.headers to see its head-

ers.

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 0.23.0

More important, if you type response.selector you will access a selector object you can use to

query the response, and convenient shortcuts like response.xpath() and response.css() mapping to

response.selector.xpath() and response.selector.css()

So let’s try it:

In [1]: response.xpath(’//title’)

Out[1]: [<Selector xpath=’//title’ data=u’<title>Open Directory - Computers: Progr’>]

In [2]: response.xpath(’//title’).extract()

Out[2]: [u’<title>Open Directory - Computers: Programming: Languages: Python: Books</title>’]

In [3]: response.xpath(’//title/text()’)

Out[3]: [<Selector xpath=’//title/text()’ data=u’Open Directory - Computers: Programming:’>]

In [4]: response.xpath(’//title/text()’).extract()

Out[4]: [u’Open Directory - Computers: Programming: Languages: Python: Books’]

In [5]: response.xpath(’//title/text()’).re(’(\w+):’)

Out[5]: [u’Computers’, u’Programming’, u’Languages’, u’Python’]

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to ﬁgure out the XPaths you need to

use. However, inspecting the raw HTML code there could become a very tedious task. To make this an easier task,

you can use some Firefox extensions like Firebug. For more information see Using Firebug for scraping and Using

Firefox for scraping.

After inspecting the page source, you’ll ﬁnd that the web sites information is inside a <ul> element, in fact the second

<ul> element.

So we can select each <li> element belonging to the sites list with this code:

sel.xpath(’//ul/li’)

And from them, the sites descriptions:

sel.xpath(’//ul/li/text()’).extract()

The sites titles:

sel.xpath(’//ul/li/a/text()’).extract()

And the sites links:

sel.xpath(’//ul/li/a/@href’).extract()

As we’ve said before, each .xpath() call returns a list of selectors, so we can concatenate further .xpath() calls

to dig deeper into a node. We are going to use that property here, so:

for sel in response.xpath(’//ul/li’)

title = sel.xpath(’a/text()’).extract()

link = sel.xpath(’a/@href’).extract()

desc = sel.xpath(’text()’).extract()

print title, link, desc

Note: For a more detailed description of using nested selectors, see Nesting selectors and Working with relative

14 Chapter 2. First steps

Scrapy Documentation, Release 0.23.0

Note: You can ﬁnd a fully-functional variant of this spider in the dirbot project available at

https://github.com/scrapy/dirbot

Now doing a crawl on the dmoz.org domain yields DmozItem objects:

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{’desc’: [u’ - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],

’link’: [u’http://gnosis.cx/TPiP/’],

’title’: [u’Text Processing in Python’]}

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>

{’desc’: [u’ - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n’],

’link’: [u’http://www.informit.com/store/product.aspx?isbn=0130211192’],

’title’: [u’XML Processing with Python’]}

2.3.4 Storing the scraped data

The simplest way to store the scraped data is by using the Feed exports, with the following command:

scrapy crawl dmoz -o items.json -t json

That will generate a items.json ﬁle containing all scraped items, serialized in JSON.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex

things with the scraped items, you can write an Item Pipeline. As with Items, a placeholder ﬁle for Item Pipelines

has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to

implement any item pipelines if you just want to store the scraped items.

2.3.5 Next steps

This tutorial covers only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What

else? section in Scrapy at a glance chapter for a quick overview of the most important ones.

Then, we recommend you continue by playing with an example project (see Examples), and then continue with the

section Basic concepts.

2.4 Examples

The best way to learn is with examples, and Scrapy is no exception. For this reason, there is an example Scrapy project

named dirbot, that you can use to play and learn more about Scrapy. It contains the dmoz spider described in the

tutorial.

This dirbot project is available at: https://github.com/scrapy/dirbot

It contains a README ﬁle with a detailed description of the project contents.

If you’re familiar with git, you can checkout the code. Otherwise you can download a tarball or zip ﬁle of the project

by clicking on Downloads.

The scrapy tag on Snipplr is used for sharing code snippets such as spiders, middlewares, extensions, or scripts. Feel

free (and encouraged!) to share any code there.

Scrapy at a glance Understand what Scrapy is and how it can help you.

Installation guide Get Scrapy installed on your computer.

16 Chapter 2. First steps

剩余200页未读，继续阅读

ForestLife100

粉丝: 0

Scrapy 0.23.0：Web爬虫框架文档

scrapy0.22 API英文版

Scrapy知乎.key

from ScrapyDemo.ScrapyDemo.items import MovieItem ModuleNotFoundError: No module named 'ScrapyDemo.ScrapyDemo'

解释class MyspiderItem(scrapy.Item): title = scrapy.Field() #剧名 fraction = scrapy.Field() #评分 region = scrapy.Field() #国家或地区 time = scrapy.Field() #时长 date = scrapy.Field() #上映日期 director = scrapy.Field() #导演

scrapy:Scrapy toscracpe.com示例

scrapy01.py

scrapy入门.png

scrapy步骤.txt

Scrapy依赖.zip

Scrapy框架.docx

最新资源