Python爬虫框架Scrapy详解

需积分: 13 83 浏览量更新于2024-07-18 收藏 1.07MB PDF 举报

"Scrapy是Python开发的爬虫框架，其设计灵感来源于Django，具有灵活性高、功能强大的特点。Scrapy提供了丰富的文档，包括初学者指南、安装教程、基本概念介绍，如命令行工具、爬虫、选择器、Item、ItemLoader、Scrapy Shell、Item Pipeline、Feed导出、请求与响应、链接提取器、设置、异常处理等。此外，Scrapy还内置了日志服务、统计收集、邮件发送、Telnet控制台和Web服务等。在解决特定问题方面，Scrapy文档涵盖了常见问题、调试技巧、蜘蛛契约、通用实践、大规模爬取、使用Firefox和Firebug进行网页抓取、内存泄漏调试、文件和图片下载处理、爬虫部署、自动节流扩展、性能基准测试以及暂停和恢复作业等功能。" Scrapy爬虫框架是一个强大的工具，用于构建网络数据抓取项目。它的核心组件包括： 1. **命令行工具**：Scrapy提供了一系列命令行工具，帮助用户创建项目、生成文件结构、启动爬虫等。 2. **爬虫(Spiders)**：Scrapy的核心部分，定义了如何抓取网页并提取数据。开发者可以自定义爬虫类，实现对目标网站的定制化抓取策略。 3. **选择器(Selectors)**：Scrapy使用XPath或CSS选择器解析HTML和XML文档，方便高效地提取所需信息。 4. **Items**：定义了要抓取的数据结构，类似于Python字典，可以方便地映射到数据库或其他持久化存储。 5. **Item Loaders**：用于填充Items，它们结合选择器，将从网页中提取的数据放入Items中。 6. **Scrapy Shell**：交互式环境，允许开发者在运行时测试和调试选择器。 7. **Item Pipeline**：处理从爬虫获取的数据，可以清洗、验证、转换数据，甚至保存到数据库。 8. **Feed导出**：Scrapy支持多种格式（如JSON、XML、CSV）导出抓取的数据。 9. **请求(Requests)和响应(Responses)**：请求是Scrapy发起的HTTP请求，响应是收到的服务器返回数据。 10. **链接提取器(Link Extractors)**：用于自动发现网页中的链接，以进行深度爬取。 11. **设置(Settings)**：配置Scrapy的行为，如爬取速度、中间件、下载器等。 12. **异常处理(Exceptions)**：处理在爬取过程中可能出现的各种错误。 Scrapy还提供了各种内置服务，以增强其功能和易用性，例如： - **日志服务(Logging)**：记录爬虫运行过程中的事件和错误信息。 - **统计收集(StatsCollection)**：收集爬虫运行时的统计信息。 - **邮件发送(Sending email)**：在特定条件下发送邮件通知。 - **Telnet Console(TelnetConsole)**：通过telnet接口远程控制爬虫。 - **Web服务(WebService)**：提供一个Web接口来监控和控制爬虫。 Scrapy不仅适合初学者，也满足高级用户的复杂需求。通过学习和实践Scrapy，开发者能够有效地构建高效、可靠的网络爬虫，实现数据的自动化抓取和处理。

Scrapy Documentation, Release 1.3.0

This is the code for our ﬁrst Spider. Save it in a ﬁle named quotes_spider.py under the tutorial/spiders

directory in your project:

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

def start_requests(self):

urls = [

'http://quotes.toscrape.com/page/1/',

'http://quotes.toscrape.com/page/2/',

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

page = response.url.split("/")[-2]

filename = 'quotes-%s.html' % page

with open(filename, 'wb') as f:

f.write(response.body)

self.log('Saved file %s' % filename)

As you can see, our Spider subclasses scrapy.Spider and deﬁnes some attributes and methods:

• name: identiﬁes the Spider. It must be unique within a project, that is, you can’t set the same name for different

Spiders.

• start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator

function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from

these initial requests.

• parse(): a method that will be called to handle the response downloaded for each of the requests made.

The response parameter is an instance of TextResponse that holds the page content and has further helpful

methods to handle it.

The parse() method usually parses the response, extracting the scraped data as dicts and also ﬁnding new

URLs to follow and creating new requests (Request) from them.

How to run our spider

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl quotes

This command runs the spider with name quotes that we’ve just added, that will send some requests for the

quotes.toscrape.com domain. You will get an output similar to this:

... (omitted for brevity)

2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened

2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)

2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html

2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html

12 Chapter 2. First steps

Scrapy Documentation, Release 1.3.0

2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)

...

Now, check the ﬁles in the current directory. You should notice that two new ﬁles have been created: quotes-1.html

and quotes-2.html, with the content for the respective URLs, as our parse method instructs.

Note: If you are wondering why we haven’t parsed the HTML yet, hold on, we will cover that soon.

What just happened under the hood?

Scrapy schedules the scrapy.Request objects returned by the start_requests method of the Spider. Upon

receiving a response for each one, it instantiates Response objects and calls the callback method associated with the

request (in this case, the parse method) passing the response as argument.

A shortcut to the start_requests method

Instead of implementing a start_requests() method that generates scrapy.Request objects from URLs,

you can just deﬁne a start_urls class attribute with a list of URLs. This list will then be used by the default

implementation of start_requests() to create the initial requests for your spider:

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

'http://quotes.toscrape.com/page/1/',

'http://quotes.toscrape.com/page/2/',

]

def parse(self, response):

page = response.url.split("/")[-2]

filename = 'quotes-%s.html' % page

with open(filename, 'wb') as f:

f.write(response.body)

The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly

told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests

without an explicitly assigned callback.

Extracting data

The best way to learn how to extract data with Scrapy is trying selectors using the shell Scrapy shell. Run:

scrapy shell 'http://quotes.toscrape.com/page/1/'

Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls

containing arguments (ie. & character) will not work.

On Windows, use double quotes instead:

2.3. Scrapy Tutorial 13

Scrapy Documentation, Release 1.3.0

scrapy shell "http://quotes.toscrape.com/page/1/"

You will see something like:

[ ... Scrapy log here ... ]

2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)

[s] Available Scrapy objects:

[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)

[s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>

[s] item {}

[s] request <GET http://quotes.toscrape.com/page/1/>

[s] response <200 http://quotes.toscrape.com/page/1/>

[s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>

[s] spider <DefaultSpider 'default' at 0x7fa91c8af990>

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

>>>

Using the shell, you can try selecting elements using CSS with the response object:

>>> response.css('title')

[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

The result of running response.css(’title’) is a list-like object called SelectorList, which represents a

list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to ﬁne-grain

the selection or extract the data.

To extract the text from the title above, you can do:

>>> response.css('title::text').extract()

['Quotes to Scrape']

There are two things to note here: one is that we’ve added ::text to the CSS query, to mean we want to select

only the text elements directly inside <title> element. If we don’t specify ::text, we’d get the full title element,

including its tags:

>>> response.css('title').extract()

['<title>Quotes to Scrape</title>']

The other thing is that the result of calling .extract() is a list, because we’re dealing with an instance of

SelectorList. When you know you just want the ﬁrst result, as in this case, you can do:

>>> response.css('title::text').extract_first()

'Quotes to Scrape'

As an alternative, you could’ve written:

>>> response.css('title::text')[0].extract()

'Quotes to Scrape'

However, using .extract_first() avoids an IndexError and returns None when it doesn’t ﬁnd any element

matching the selection.

There’s a lesson here: for most scraping code, you want it to be resilient to errors due to things not being found on a

page, so that even if some parts fail to be scraped, you can at least get some data.

Besides the extract() and extract_first() methods, you can also use the re() method to extract using

regular expressions:

14 Chapter 2. First steps

Scrapy Documentation, Release 1.3.0

>>> response.css('title::text').re(r'Quotes.

['Quotes to Scrape']

>>> response.css('title::text').re(r'Q\w+')

['Quotes']

>>> response.css('title::text').re(r'(\w+) to (\w+)')

['Quotes', 'Scrape']

In order to ﬁnd the proper CSS selectors to use, you might ﬁnd useful opening the response page from the shell in

your web browser using view(response). You can use your browser developer tools or extensions like Firebug

(see sections about Using Firebug for scraping and Using Firefox for scraping).

Selector Gadget is also a nice tool to quickly ﬁnd CSS selector for visually selected elements, which works in many

browsers.

XPath: a brief intro

Besides CSS, Scrapy selectors also support using XPath expressions:

>>> response.xpath('//title')

[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]

>>> response.xpath('//title/text()').extract_first()

'Quotes to Scrape'

XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted

to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the

shell.

While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the

structure, it can also look at the content. Using XPath, you’re able to select things like: select the link that contains the

text “Next Page”. This makes XPath very ﬁtting to the task of scraping, and we encourage you to learn XPath even if

you already know how to construct CSS selectors, it will make scraping much easier.

We won’t cover much of XPath here, but you can read more about using XPath with Scrapy Selectors here. To learn

more about XPath, we recommend this tutorial to learn XPath through examples, and this tutorial to learn “how to

think in XPath”.

Extracting quotes and authors

Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the

quotes from the web page.

Each quote in http://quotes.toscrape.com is represented by HTML elements that look like this:

<span class="text">“The world as we have created it is a process of our

thinking. It cannot be changed without changing our thinking.”</span>

<span>

by <small class="author">Albert Einstein</small>

<a href="/author/Albert-Einstein">(about)</a>

</span>

Tags:

<a class="tag" href="/tag/change/page/1/">change</a>

<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>

<a class="tag" href="/tag/thinking/page/1/">thinking</a>

<a class="tag" href="/tag/world/page/1/">world</a>

2.3. Scrapy Tutorial 15

Scrapy Documentation, Release 1.3.0

</div>

Let’s open up scrapy shell and play a bit to ﬁnd out how to extract the data we want:

$ scrapy shell 'http://quotes.toscrape.com'

We get a list of selectors for the quote HTML elements with:

>>> response.css("div.quote")

Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign

the ﬁrst selector to a variable, so that we can run our CSS selectors directly on a particular quote:

>>> quote = response.css("div.quote")[0]

Now, let’s extract title, author and the tags from that quote using the quote object we just created:

>>> title = quote.css("span.text::text").extract_first()

>>> title

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

>>> author = quote.css("small.author::text").extract_first()

>>> author

'Albert Einstein'

Given that the tags are a list of strings, we can use the .extract() method to get all of them:

>>> tags = quote.css("div.tags a.tag::text").extract()

>>> tags

['change', 'deep-thoughts', 'thinking', 'world']

Having ﬁgured out how to extract each bit, we can now iterate over all the quotes elements and put them together into

a Python dictionary:

>>> for quote in response.css("div.quote"):

... text = quote.css("span.text::text").extract_first()

... author = quote.css("small.author::text").extract_first()

... tags = quote.css("div.tags a.tag::text").extract()

... print(dict(text=text, author=author, tags=tags))

{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}

{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}

... a few more of these, omitted for brevity

>>>

Extracting data in our spider

Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a

local ﬁle. Let’s integrate the extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use

the yield Python keyword in the callback, as you can see below:

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

'http://quotes.toscrape.com/page/1/',

16 Chapter 2. First steps

剩余257页未读，继续阅读

weixin_39884942

粉丝: 0

Python爬虫框架Scrapy详解

scrapy爬虫项目

Python程序设计：Scrapy爬虫框架的使用.pptx

完整版Python网络爬虫之Scrapy爬虫框架使用案例教程含源代码共18页.pdf

scrapy爬虫框架

Scrapy爬虫框架

最新Scrapy爬虫框架

scrapy爬虫框架程序

Scrapy爬虫框架笔记

PythonScrapy爬虫框架学习

pytcharm 搭建 scrapy爬虫框架

最新资源