Scrapy Documentation, Release 0.24.4
Crawling
To put our spider to work, go to the project’s top level directory and run:
scrapy crawl dmoz
The crawl dmoz command runs the spider for the dmoz.org domain. You will get an output similar to this:
2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened
2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2014-01-23 18:13:09-0400 [dmoz] INFO: Closing spider (finished)
Pay attention to the lines containing [dmoz], which corresponds to our spider. You can see a log line for each URL
defined in start_urls. Because these URLs are the starting ones, they have no referrers, which is shown at the end
of the log line, where it says (referer: None).
But more interesting, as our parse method instructs, two files have been created: Books and Resources, with the
content of both URLs.
What just happened under the hood?
Scrapy creates scrapy.Request objects for each URL in the start_urls attribute of the Spider, and assigns
them the parse method of the spider as their callback function.
These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed
back to the spider, through the parse() method.
Extracting Items
Introduction to Selectors
There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath or CSS expressions
called Scrapy Selectors. For more information about selectors and other extraction mechanisms see the Selectors
documentation.
Here are some examples of XPath expressions and their meanings:
• /html/head/title: selects the <title> element, inside the <head> element of a HTML document
• /html/head/title/text(): selects the text inside the aforementioned <title> element.
• //td: selects all the <td> elements
• //div[@class="mine"]: selects all div elements which contain an attribute class="mine"
These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much
more powerful. To learn more about XPath we recommend this XPath tutorial.
For working with XPaths, Scrapy provides Selector class and convenient shortcuts to avoid instantiating selectors
yourself everytime you need to select something from a response.
12 Chapter 2. First steps