Scrapy
Scrapy
Release 1.0.3
Scrapy developers
Contents
Getting help
First steps
2.1 Scrapy at a glance
2.2 Installation guide .
2.3 Scrapy Tutorial . .
2.4 Examples . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
7
9
16
Basic concepts
3.1 Command line tool . . .
3.2 Spiders . . . . . . . . .
3.3 Selectors . . . . . . . .
3.4 Items . . . . . . . . . .
3.5 Item Loaders . . . . . .
3.6 Scrapy shell . . . . . .
3.7 Item Pipeline . . . . . .
3.8 Feed exports . . . . . .
3.9 Requests and Responses
3.10 Link Extractors . . . . .
3.11 Settings . . . . . . . . .
3.12 Exceptions . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
27
37
48
51
60
64
66
71
79
80
98
Built-in services
4.1 Logging . . . . .
4.2 Stats Collection .
4.3 Sending e-mail .
4.4 Telnet Console .
4.5 Web Service . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
101
104
106
108
110
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
111
111
115
117
119
122
125
126
130
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.9
5.10
5.11
5.12
5.13
5.14
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
134
139
139
140
142
142
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
145
145
147
158
162
167
173
177
183
183
212
214
Extending Scrapy
6.1 Architecture overview .
6.2 Downloader Middleware
6.3 Spider Middleware . . .
6.4 Extensions . . . . . . .
6.5 Core API . . . . . . . .
6.6 Signals . . . . . . . . .
6.7 Item Exporters . . . . .
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
215
Contents
Contents
CHAPTER 1
Getting help
CHAPTER 2
First steps
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['http://stackoverflow.com/questions?sort=votes']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract()[0],
'votes': response.css('.question .vote-count-post::text').extract()[0],
'body': response.css('.question .post-text').extract()[0],
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}
Put this in a file, name it to something like stackoverflow_spider.py and run the spider using the
runspider command:
When this finishes you will have in the top-stackoverflow-questions.json file a list of the most upvoted
questions in StackOverflow in JSON format, containing the title, link, number of upvotes, a list of the tags and the
question content in HTML, looking like this (reformatted for easier reading):
[{
Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and
XPath expressions, with helper methods to extract using regular expressions.
An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very
useful when writing or debugging your spiders.
Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple
backends (FTP, S3, local filesystem)
Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.
Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined
API (middlewares, extensions, and pipelines).
Wide range of built-in extensions and middlewares for handling:
cookies and session handling
HTTP features like compression, authentication, caching
user-agent spoofing
robots.txt
crawl depth restriction
and more
A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug
your crawler
Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline
for automatically downloading images (or any other media) associated with the scraped items, a caching DNS
resolver, and much more!
Otherwise refer to
OpenSSL. This comes preinstalled in all operating systems, except Windows where the Python installer ships it
bundled.
You can install Scrapy using pip (which is the canonical way to install Python packages).
To install using pip:
pip install Scrapy
Close the command prompt window and reopen it so changes take effect, run the following command and check
it shows the expected Python version:
python --version
At this point Python 2.7 and pip package manager must be working, lets install Scrapy:
pip install Scrapy
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
We begin by modeling the item that we will use to hold the sites data obtained from dmoz.org. As we want to capture
the name, url and description of the sites, we define fields for each of these three attributes. To do that, we edit
items.py, found in the tutorial directory. Our Item class looks like this:
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
This may seem complicated at first, but defining an item class allows you to use other handy components and helpers
within Scrapy.
Crawling
To put our spider to work, go to the projects top level directory and run:
10
This command runs the spider with name dmoz that weve just added, that will send some requests for the dmoz.org
domain. You will get an output similar to this:
2014-01-23
2014-01-23
2014-01-23
2014-01-23
2014-01-23
2014-01-23
2014-01-23
2014-01-23
2014-01-23
2014-01-23
2014-01-23
18:13:07-0400
18:13:07-0400
18:13:07-0400
18:13:07-0400
18:13:07-0400
18:13:07-0400
18:13:07-0400
18:13:07-0400
18:13:08-0400
18:13:09-0400
18:13:09-0400
[scrapy]
[scrapy]
[scrapy]
[scrapy]
[scrapy]
[scrapy]
[scrapy]
[scrapy]
[scrapy]
[scrapy]
[scrapy]
Note: At the end you can see a log line for each URL defined in start_urls. Because these URLs are the starting
ones, they have no referrers, which is shown at the end of the log line, where it says (referer: None).
Now, check the files in the current directory. You should notice two new files have been created: Books.html and
Resources.html, with the content for the respective URLs, as our parse method instructs.
What just happened under the hood?
Scrapy creates scrapy.Request objects for each URL in the start_urls attribute of the Spider, and assigns
them the parse method of the spider as their callback function.
These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed
back to the spider, through the parse() method.
Extracting Items
Introduction to Selectors
There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath or CSS expressions
called Scrapy Selectors. For more information about selectors and other extraction mechanisms see the Selectors
documentation.
Here are some examples of XPath expressions and their meanings:
/html/head/title: selects the <title> element, inside the <head> element of an HTML document
/html/head/title/text(): selects the text inside the aforementioned <title> element.
//td: selects all the <td> elements
//div[@class="mine"]: selects all div elements which contain an attribute class="mine"
These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much
more powerful. To learn more about XPath, we recommend this tutorial to learn XPath through examples, and this
tutorial to learn how to think in XPath.
Note: CSS vs XPath: you can go a long way extracting data from web pages using only CSS selectors. However,
XPath offers more power because besides navigating the structure, it can also look at the content: youre able to select
11
things like: the link that contains the text Next Page. Because of this, we encourage you to learn about XPath even if
you already know how to construct CSS selectors.
For working with CSS and XPath expressions, Scrapy provides Selector class and convenient shortcuts to avoid
instantiating selectors yourself every time you need to select something from a response.
You can see selectors as objects that represent nodes in the document structure. So, the first instantiated selectors are
associated with the root node, or the entire document.
Selectors have four basic methods (click on the method to see the complete API documentation):
xpath(): returns a list of selectors, each of which represents the nodes selected by the xpath expression given
as argument.
css(): returns a list of selectors, each of which represents the nodes selected by the CSS expression given as
argument.
extract(): returns a unicode string with the selected data.
re(): returns a list of unicode strings extracted by applying the regular expression given as argument.
Trying Selectors in the Shell
To illustrate the use of Selectors were going to use the built-in Scrapy shell, which also requires IPython (an extended
Python console) installed on your system.
To start a shell, you must go to the projects top level directory and run:
scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls
containing arguments (ie. & character) will not work.
This is what the shell looks like:
[ ... Scrapy log here ... ]
After the shell loads, you will have the response fetched in a local response variable, so if you type
response.body you will see the body of the response, or you can type response.headers to see its headers.
More importantly response has a selector
class, instantiated with this particular response.
12
response.selector.xpath() or response.selector.css(). There are also some convenience shortcuts like response.xpath() or response.css() which map directly to response.selector.xpath()
and response.selector.css().
So lets try it:
In [1]: response.xpath('//title')
Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]
In [2]: response.xpath('//title').extract()
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']
In [3]: response.xpath('//title/text()')
Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]
In [4]: response.xpath('//title/text()').extract()
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
In [5]: response.xpath('//title/text()').re('(\w+):')
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']
Now, lets try to extract some real information from those pages.
You could type response.body in the console, and inspect the source code to figure out the XPaths you need to
use. However, inspecting the raw HTML code there could become a very tedious task. To make it easier, you can
use Firefox Developer Tools or some Firefox extensions like Firebug. For more information see Using Firebug for
scraping and Using Firefox for scraping.
After inspecting the page source, youll find that the web sites information is inside a <ul> element, in fact the
second <ul> element.
So we can select each <li> element belonging to the sites list with this code:
response.xpath('//ul/li')
As weve said before, each .xpath() call returns a list of selectors, so we can concatenate further .xpath() calls
to dig deeper into a node. We are going to use that property here, so:
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
Note:
For a more detailed description of using nested selectors, see Nesting selectors and Working with relative
13
Now try crawling dmoz.org again and youll see sites being printed in your output. Run:
scrapy crawl dmoz
So, in order to return the data weve scraped so far, the final code for our Spider would be like this:
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
14
Note:
You can find a fully-functional variant of this spider in the dirbot project available at
https://github.com/scrapy/dirbot
Now crawling dmoz.org yields DmozItem objects:
Now the parse() method only extract the interesting links from the page, builds a full absolute URL using the response.urljoin method (since the links can be relative) and yields new requests to be sent later, registering as callback
the method parse_dir_contents() that will ultimately scrape the data we want.
What you see here is the Scrapys mechanism of following links: when you yield a Request in a callback method,
Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.
Using this, you can build complex crawlers that follow links according to rules you define, and extract different kinds
of data depending on the page its visiting.
15
A common pattern is a callback method that extract some items, looks for a link to follow to the next page and then
yields a Request with the same callback for it:
def parse_articles_follow_next_page(self, response):
for article in response.xpath("//article"):
item = ArticleItem()
... extract article data here
yield item
next_page = response.css("ul.navigation > li.next-page > a::attr('href')")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_articles_follow_next_page)
This creates a sort of loop, following all the links to the next page until it doesnt find one handy for crawling blogs,
forums and other sites with pagination.
Another common pattern is to build an item with data from more than one page, using a trick to pass additional data
to the callbacks.
Note: As an example spider that leverages this mechanism, check out the CrawlSpider class for a generic spider
that implements a small rules engine that you can use to write your crawlers on top of it.
That will generate an items.json file containing all scraped items, serialized in JSON.
In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex
things with the scraped items, you can write an Item Pipeline. As with Items, a placeholder file for Item Pipelines
has been set up for you when the project is created, in tutorial/pipelines.py. Though you dont need to
implement any item pipelines if you just want to store the scraped items.
2.4 Examples
The best way to learn is with examples, and Scrapy is no exception. For this reason, there is an example Scrapy project
named dirbot, that you can use to play and learn more about Scrapy. It contains the dmoz spider described in the
tutorial.
This dirbot project is available at: https://github.com/scrapy/dirbot
It contains a README file with a detailed description of the project contents.
16
If youre familiar with git, you can checkout the code. Otherwise you can download a tarball or zip file of the project
by clicking on Downloads.
The scrapy tag on Snipplr is used for sharing code snippets such as spiders, middlewares, extensions, or scripts. Feel
free (and encouraged!) to share any code there.
Scrapy at a glance Understand what Scrapy is and how it can help you.
Installation guide Get Scrapy installed on your computer.
Scrapy Tutorial Write your first Scrapy project.
Examples Learn more by playing with a pre-made Scrapy project.
2.4. Examples
17
18
CHAPTER 3
Basic concepts
19
scrapy.cfg
myproject/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
The directory where the scrapy.cfg file resides is known as the project root directory. That file contains the name
of the python module that defines the project settings. Here is an example:
[settings]
default = myproject.settings
The first line will print the currently active project if youre inside a Scrapy project. In this example it was run from
outside a project. If run from inside a project it would have printed something like this:
Scrapy X.Y - project: myproject
Usage:
scrapy <command> [options] [args]
[...]
Creating projects
The first thing you typically do with the scrapy tool is create your Scrapy project:
scrapy startproject myproject
And youre ready to use the scrapy command to manage and control your project from there.
20
Controlling projects
You use the scrapy tool from inside your projects to control and manage them.
For example, to create a new spider:
scrapy genspider mydomain mydomain.com
Some Scrapy commands (like crawl) must be run from inside a Scrapy project. See the commands reference below
for more information on which commands must be run from inside projects, and which not.
Also keep in mind that some commands may have slightly different behaviours when running them from inside
projects. For example, the fetch command will use spider-overridden behaviours (such as the user_agent attribute
to override the user-agent) if the url being fetched is associated with some specific spider. This is intentional, as the
fetch command is meant to be used to check how spiders are downloading pages.
There are two kinds of commands, those that only work from inside a Scrapy project (Project-specific commands) and
those that also work without an active Scrapy project (Global commands), though they may behave slightly different
when running from inside a project (as they would use the project overridden settings).
Global commands:
startproject
settings
runspider
shell
fetch
view
version
Project-only commands:
crawl
check
list
edit
parse
genspider
bench
21
startproject
Syntax: scrapy startproject <project_name>
Requires project: no
Creates a new Scrapy project named project_name, under the project_name directory.
Usage example:
$ scrapy startproject myproject
genspider
Syntax: scrapy genspider [-t template] <name> <domain>
Requires project: yes
Create a new spider in the current project.
This is just a convenience shortcut command for creating spiders based on pre-defined templates, but certainly not the
only way to create spiders. You can just create the spider source code files yourself, instead of using this command.
Usage example:
$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed
$ scrapy genspider -d basic
import scrapy
class $classname(scrapy.Spider):
name = "$name"
allowed_domains = ["$domain"]
start_urls = (
'http://www.$domain/',
)
def parse(self, response):
pass
$ scrapy genspider -t basic example example.com
Created spider 'example' using template 'basic' in module:
mybot.spiders.example
crawl
Syntax: scrapy crawl <spider>
Requires project: yes
Start crawling using a spider.
Usage examples:
22
check
Syntax: scrapy check [-l] <spider>
Requires project: yes
Run contract checks.
Usage examples:
$ scrapy check -l
first_spider
* parse
* parse_item
second_spider
* parse
* parse_item
$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing
[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4
list
Syntax: scrapy list
Requires project: yes
List all available spiders in the current project. The output is one spider per line.
Usage example:
$ scrapy list
spider1
spider2
edit
Syntax: scrapy edit <spider>
Requires project: yes
Edit the given spider using the editor defined in the EDITOR setting.
This command is provided only as a convenience shortcut for the most common case, the developer is of course free
to choose any tool or IDE to write and debug his spiders.
Usage example:
$ scrapy edit spider1
23
fetch
Syntax: scrapy fetch <url>
Requires project: no
Downloads the given URL using the Scrapy downloader and writes the contents to standard output.
The interesting thing about this command is that it fetches the page how the spider would download it. For example,
if the spider has a USER_AGENT attribute which overrides the User Agent, it will use that one.
So this command can be used to see how your spider would fetch a certain page.
If used outside a project, no particular per-spider behaviour would be applied and it will just use the default Scrapy
downloader settings.
Usage examples:
$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]
$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
'Age': ['1263
'],
'Connection': ['close
'],
'Content-Length': ['596'],
'Content-Type': ['text/html; charset=UTF-8'],
'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
'Etag': ['"573c1-254-48c9c87349680"'],
'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
'Server': ['Apache/2.2.3 (CentOS)']}
view
Syntax: scrapy view <url>
Requires project: no
Opens the given URL in a browser, as your Scrapy spider would see it. Sometimes spiders see pages differently
from regular users, so this can be used to check what the spider sees and confirm its what you expect.
Usage example:
$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]
shell
Syntax: scrapy shell [url]
Requires project: no
Starts the Scrapy shell for the given URL (if given) or empty if no URL is given. See Scrapy shell for more info.
Usage example:
$ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]
24
parse
Syntax: scrapy parse <url> [options]
Requires project: yes
Fetches the given URL and parses it with the spider that handles it, using the method passed with the --callback
option, or parse if not given.
Supported options:
--spider=SPIDER: bypass spider autodetection and force use of specific spider
--a NAME=VALUE: set spider argument (may be repeated)
--callback or -c: spider method to use as callback for parsing the response
--pipelines: process items through pipelines
--rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the
response
--noitems: dont show scraped items
--nolinks: dont show extracted links
--nocolour: avoid using pygments to colorize the output
--depth or -d: depth level for which the requests should be followed recursively (default: 1)
--verbose or -v: display information for each depth level
Usage example:
$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]
>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items -----------------------------------------------------------[{'name': u'Example item',
'category': u'Furniture',
'length': u'12 cm'}]
# Requests
[]
-----------------------------------------------------------------
settings
Syntax: scrapy settings [options]
Requires project: no
Get the value of a Scrapy setting.
If used inside a project itll show the project setting value, otherwise itll show the default Scrapy value for that setting.
Example usage:
$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0
25
runspider
Syntax: scrapy runspider <spider_file.py>
Requires project: no
Run a spider self-contained in a Python file, without having to create a project.
Example usage:
$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]
version
Syntax: scrapy version [-v]
Requires project: no
Prints the Scrapy version. If used with -v it also prints Python, Twisted and Platform info, which is useful for bug
reports.
bench
New in version 0.17.
Syntax: scrapy bench
Requires project: no
Run a quick benchmark test. Benchmarking.
26
3.2 Spiders
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform
the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words,
Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or,
in some cases, a group of sites).
For spiders, the scraping cycle goes through something like this:
1. You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called
with the response downloaded from those requests.
The first requests to perform are obtained by calling the start_requests() method which (by default)
generates Request for the URLs specified in the start_urls and the parse method as callback function
for the Requests.
2. In the callback function, you parse the response (web page) and return either dicts with extracted data, Item
objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe
the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
3. In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup,
lxml or whatever mechanism you prefer) and generate items with the parsed data.
4. Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or
written to a file using Feed exports.
Even though this cycle applies (more or less) to any kind of spider, there are different kinds of default spiders bundled
into Scrapy for different purposes. We will talk about those types here.
3.2.1 scrapy.Spider
class scrapy.spiders.Spider
This is the simplest spider, and the one from which every other spider must inherit (including spiders that come
bundled with Scrapy, as well as spiders that you write yourself). It doesnt provide any special functionality. It
just provides a default start_requests() implementation which sends requests from the start_urls
spider attribute and calls the spiders method parse for each of the resulting responses.
name
A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique. However, nothing prevents you from instantiating more than one
instance of the same spider. This is the most important spider attribute and its required.
If the spider scrapes a single domain, a common practice is to name the spider after the domain, with
or without the TLD. So, for example, a spider that crawls mywebsite.com would often be called
mywebsite.
3.2. Spiders
27
allowed_domains
An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs
not belonging to the domain names specified in this list wont be followed if OffsiteMiddleware is
enabled.
start_urls
A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So,
the first pages downloaded will be those listed here. The subsequent URLs will be generated successively
from data contained in the start URLs.
custom_settings
A dictionary of settings that will be overridden from the project wide configuration when running this
spider. It must be defined as a class attribute since the settings are updated before instantiation.
For a list of available built-in settings see: Built-in settings reference.
crawler
This attribute is set by the from_crawler() class method after initializating the class, and links to the
Crawler object to which this spider instance is bound.
Crawlers encapsulate a lot of components in the project for their single entry access (such as extensions,
middlewares, signals managers, etc). See Crawler API to know more about them.
settings
Configuration on which this spider is been ran. This is a Settings instance, see the Settings topic for a
detailed introduction on this subject.
logger
Python logger created with the Spiders name. You can use it to send log messages through it as described
on Logging from Spiders.
from_crawler(crawler, *args, **kwargs)
This is the class method used by Scrapy to create your spiders.
You probably wont need to override this directly, since the default implementation acts as a proxy to the
__init__() method, calling it with the given arguments args and named arguments kwargs.
Nonetheless, this method sets the crawler and settings attributes in the new instance, so they can be
accessed later inside the spiders code.
Parameters
crawler (Crawler instance) crawler to which the spider will be bound
args (list) arguments passed to the __init__() method
kwargs (dict) keyword arguments passed to the __init__() method
start_requests()
This method must return an iterable with the first Requests to crawl for this spider.
This is the method called by Scrapy when the spider is opened for scraping when no particular URLs
are specified. If particular URLs are specified, the make_requests_from_url() is used instead to
create the Requests. This method is also called only once from Scrapy, so its safe to implement it as a
generator.
The default implementation uses make_requests_from_url() to generate Requests for each url in
start_urls.
If you want to change the Requests used to start scraping a domain, this is the method to override. For
example, if you need to start by logging in using a POST request, you could do:
28
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
return [scrapy.FormRequest("http://www.example.com/login",
formdata={'user': 'john', 'pass': 'secret'},
callback=self.logged_in)]
def logged_in(self, response):
# here you would extract links to follow and return Requests for
# each of them, with another callback
pass
make_requests_from_url(url)
A method that receives a URL and returns a Request object (or a list of Request objects) to scrape.
This method is used to construct the initial requests in the start_requests() method, and is typically
used to convert urls to requests.
Unless overridden, this method returns Requests with the parse() method as their callback function,
and with dont_filter parameter enabled (see Request class for more info).
parse(response)
This is the default callback used by Scrapy to process downloaded responses, when their requests dont
specify a callback.
The parse method is in charge of processing the response and returning scraped data and/or more URLs
to follow. Other Requests callbacks have the same requirements as the Spider class.
This method, as well as any other Request callback, must return an iterable of Request and/or dicts or
Item objects.
Parameters response (Response) the response to parse
log(message[, level, component ])
Wrapper that sends a log message through the Spiders logger, kept for backwards compatibility. For
more information see Logging from Spiders.
closed(reason)
Called when the spider closes.
spider_closed signal.
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
3.2. Spiders
29
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
for h3 in response.xpath('//h3').extract():
yield {"title": h3}
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
Instead of start_urls you can use start_requests() directly; to give data more structure you can use Items:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
def start_requests(self):
yield scrapy.Request('http://www.example.com/1.html', self.parse)
yield scrapy.Request('http://www.example.com/2.html', self.parse)
yield scrapy.Request('http://www.example.com/3.html', self.parse)
def parse(self, response):
for h3 in response.xpath('//h3').extract():
yield MyItem(title=h3)
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
30
Spider arguments can also be passed through the Scrapyd schedule.json API. See Scrapyd documentation.
CrawlSpider
class scrapy.spiders.CrawlSpider
This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for
following links by defining a set of rules. It may not be the best suited for your particular web sites or project,
but its generic enough for several cases, so you can start from it and override it as needed for more custom
functionality, or just implement your own spider.
Apart from the attributes inherited from Spider (that you must specify), this class supports a new attribute:
rules
Which is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the
site. Rules objects are described below. If multiple rules match the same link, the first one will be used,
according to the order theyre defined in this attribute.
This spider also exposes an overrideable method:
parse_start_url(response)
This method is called for the start_urls responses. It allows to parse the initial responses and must return
either an Item object, a Request object, or an iterable containing any of them.
Crawling rules
3.2. Spiders
31
Warning: When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses
the parse method itself to implement its logic. So if you override the parse method, the crawl spider will
no longer work.
cb_kwargs is a dict containing the keyword arguments to be passed to the callback function.
follow is a boolean which specifies if links should be followed from each response extracted with this rule. If
callback is None follow defaults to True, otherwise it defaults to False.
process_links is a callable, or a string (in which case a method from the spider object with that name
will be used) which will be called for each list of links extracted from each response using the specified
link_extractor. This is mainly used for filtering purposes.
process_request is a callable, or a string (in which case a method from the spider object with that name
will be used) which will be called with every request extracted by this rule, and must return a request or None
(to filter out the request).
CrawlSpider example
This spider would start crawling example.coms home page, collecting category links, and item links, parsing the latter
with the parse_item method. For each item response, some data will be extracted from the HTML using XPath,
and an Item will be filled with it.
XMLFeedSpider
class scrapy.spiders.XMLFeedSpider
XMLFeedSpider is designed for parsing XML feeds by iterating through them by a certain node name. The
iterator can be chosen from: iternodes, xml, and html. Its recommended to use the iternodes iterator
32
for performance reasons, since the xml and html iterators generate the whole DOM at once in order to parse
it. However, using html as the iterator may be useful when parsing XML with bad markup.
To set the iterator and the tag name, you must define the following class attributes:
iterator
A string which defines the iterator to use. It can be either:
iternodes - a fast iterator based on regular expressions
html - an iterator which uses Selector. Keep in mind this uses DOM parsing and must load
all DOM in memory which could be a problem for big feeds
xml - an iterator which uses Selector. Keep in mind this uses DOM parsing and must load all
DOM in memory which could be a problem for big feeds
It defaults to: iternodes.
itertag
A string with the name of the node (or element) to iterate in. Example:
itertag = 'product'
namespaces
A list of (prefix, uri) tuples which define the namespaces available in that document that will be
processed with this spider. The prefix and uri will be used to automatically register namespaces using
the register_namespace() method.
You can then specify nodes with namespaces in the itertag attribute.
Example:
class YourSpider(XMLFeedSpider):
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:url'
# ...
Apart from these new attributes, this spider has the following overrideable methods too:
adapt_response(response)
A method that receives the response as soon as it arrives from the spider middleware, before the spider
starts parsing it. It can be used to modify the response body before parsing it. This method receives a
response and also returns a response (it could be the same or another one).
parse_node(response, selector)
This method is called for the nodes matching the provided tag name (itertag). Receives the response
and an Selector for each node. Overriding this method is mandatory. Otherwise, you spider wont
work. This method must return either a Item object, a Request object, or an iterable containing any of
them.
process_results(response, results)
This method is called for each result (item or request) returned by the spider, and its intended to perform
any last time processing required before returning the results to the framework core, for example setting
the item IDs. It receives a list of results and the response which originated those results. It must return a
list of results (Items or Requests).
XMLFeedSpider example
These spiders are pretty easy to use, lets have a look at one example:
3.2. Spiders
33
Basically what we did up there was to create a spider that downloads a feed from the given start_urls, and then
iterates through each of its item tags, prints them out, and stores some random data in an Item.
CSVFeedSpider
class scrapy.spiders.CSVFeedSpider
This spider is very similar to the XMLFeedSpider, except that it iterates over rows, instead of nodes. The method
that gets called in each iteration is parse_row().
delimiter
A string with the separator character for each field in the CSV file Defaults to , (comma).
quotechar
A string with the enclosure character for each field in the CSV file Defaults to " (quotation mark).
headers
A list of the rows contained in the file CSV feed which will be used to extract fields from it.
parse_row(response, row)
Receives a response and a dict (representing each row) with a key for each provided (or detected)
header of the CSV file. This spider also gives the opportunity to override adapt_response and
process_results methods for pre- and post-processing purposes.
CSVFeedSpider example
Lets see an example similar to the previous one, but using a CSVFeedSpider:
from scrapy.spiders import CSVFeedSpider
from myproject.items import TestItem
class MySpider(CSVFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.csv']
delimiter = ';'
quotechar = "'"
headers = ['id', 'name', 'description']
34
SitemapSpider
class scrapy.spiders.SitemapSpider
SitemapSpider allows you to crawl a site by discovering the URLs using Sitemaps.
It supports nested sitemaps and discovering sitemap urls from robots.txt.
sitemap_urls
A list of urls pointing to the sitemaps whose urls you want to crawl.
You can also point to a robots.txt and it will be parsed to extract sitemap urls from it.
sitemap_rules
A list of tuples (regex, callback) where:
regex is a regular expression to match urls extracted from sitemaps. regex can be either a str or a
compiled regex object.
callback is the callback to use for processing the urls that match the regular expression. callback
can be a string (indicating the name of a spider method) or a callable.
For example:
sitemap_rules = [('/product/', 'parse_product')]
Rules are applied in order, and only the first one that matches will be used.
If you omit this attribute, all urls found in sitemaps will be processed with the parse callback.
sitemap_follow
A list of regexes of sitemap that should be followed. This is is only for sites that use Sitemap index files
that point to other sitemap files.
By default, all sitemaps are followed.
sitemap_alternate_links
Specifies if alternate links for one url should be followed. These are links for the same website in another
language passed within the same url block.
For example:
<url>
<loc>http://example.com/</loc>
<xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/>
</url>
With
3.2. Spiders
35
SitemapSpider examples
Simplest example: process all urls discovered through sitemaps using the parse callback:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
def parse(self, response):
pass # ... scrape item here ...
Process some urls with certain callback and other urls with a different callback:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
sitemap_rules = [
('/product/', 'parse_product'),
('/category/', 'parse_category'),
]
def parse_product(self, response):
pass # ... scrape product ...
def parse_category(self, response):
pass # ... scrape category ...
Follow sitemaps defined in the robots.txt file and only follow sitemaps whose url contains /sitemap_shop:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
sitemap_rules = [
('/shop/', 'parse_shop'),
]
sitemap_follow = ['/sitemap_shops']
def parse_shop(self, response):
pass # ... scrape shop here ...
36
3.3 Selectors
When youre scraping web pages, the most common task you need to perform is to extract data from the HTML source.
There are several libraries available to achieve this:
BeautifulSoup is a very popular web scraping library among Python programmers which constructs a Python
object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one
drawback: its slow.
lxml is an XML parsing library (which also parses HTML) with a pythonic API based on ElementTree. (lxml is
not part of the Python standard library.)
Scrapy comes with its own mechanism for extracting data. Theyre called selectors because they select certain parts
of the HTML document specified either by XPath or CSS expressions.
XPath is a language for selecting nodes in XML documents, which can also be used with HTML. CSS is a language
for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.
Scrapy selectors are built over the lxml library, which means theyre very similar in speed and parsing accuracy.
This page explains how selectors work and describes their API which is very small and simple, unlike the lxml API
which is much bigger because the lxml library can be used for many other tasks, besides selecting markup documents.
For a complete reference of the selectors API see Selector reference
For convenience, response objects expose a selector on .selector attribute, its totally OK to use this shortcut when
possible:
3.3. Selectors
37
>>> response.selector.xpath('//span/text()').extract()
[u'good']
Using selectors
To explain how to use the selectors well use the Scrapy shell (which provides interactive testing) and an example page
located in the Scrapy documentation server:
http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
Heres its HTML code:
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image
<a href='image2.html'>Name: My image
<a href='image3.html'>Name: My image
<a href='image4.html'>Name: My image
<a href='image5.html'>Name: My image
</div>
</body>
</html>
1
2
3
4
5
<br
<br
<br
<br
<br
/><img
/><img
/><img
/><img
/><img
src='image1_thumb.jpg'
src='image2_thumb.jpg'
src='image3_thumb.jpg'
src='image4_thumb.jpg'
src='image5_thumb.jpg'
/></a>
/></a>
/></a>
/></a>
/></a>
Then, after the shell loads, youll have the response available as response shell variable, and its attached selector in
response.selector attribute.
Since were dealing with HTML, the selector will automatically use an HTML parser.
So, by looking at the HTML code of that page, lets construct an XPath for selecting the text inside the title tag:
>>> response.selector.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
Querying responses using XPath and CSS is so common that responses include two convenience shortcuts:
response.xpath() and response.css():
>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]
As you can see, .xpath() and .css() methods return a SelectorList instance, which is a list of new selectors.
This API can be used for quickly selecting nested data:
>>> response.css('img').xpath('@src').extract()
[u'image1_thumb.jpg',
u'image2_thumb.jpg',
u'image3_thumb.jpg',
u'image4_thumb.jpg',
u'image5_thumb.jpg']
38
To actually extract the textual data, you must call the selector .extract() method, as follows:
>>> response.xpath('//title/text()').extract()
[u'Example website']
If you want to extract only first matched element, you can call the selector .extract_first()
>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
u'Name: My image 1 '
Notice that CSS selectors can select text or attribute nodes using CSS3 pseudo-elements:
>>> response.css('title::text').extract()
[u'Example website']
Now were going to get the base URL and some image links:
>>> response.xpath('//base/@href').extract()
[u'http://example.com/']
>>> response.css('base::attr(href)').extract()
[u'http://example.com/']
>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
u'image2.html',
u'image3.html',
u'image4.html',
u'image5.html']
>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
u'image2.html',
u'image3.html',
u'image4.html',
u'image5.html']
>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
u'image2_thumb.jpg',
u'image3_thumb.jpg',
u'image4_thumb.jpg',
u'image5_thumb.jpg']
>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
u'image2_thumb.jpg',
u'image3_thumb.jpg',
u'image4_thumb.jpg',
u'image5_thumb.jpg']
3.3. Selectors
39
Nesting selectors
The selection methods (.xpath() or .css()) return a list of selectors of the same type, so you can call the selection
methods for those selectors too. Heres an example:
>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
>>> for index, link in enumerate(links):
...
args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
...
print 'Link number %d points to url %s and image %s' % args
Link
Link
Link
Link
Link
number
number
number
number
number
0
1
2
3
4
points
points
points
points
points
to
to
to
to
to
url
url
url
url
url
[u'image1.html']
[u'image2.html']
[u'image3.html']
[u'image4.html']
[u'image5.html']
and
and
and
and
and
image
image
image
image
image
[u'image1_thumb.jpg']
[u'image2_thumb.jpg']
[u'image3_thumb.jpg']
[u'image4_thumb.jpg']
[u'image5_thumb.jpg']
Theres an additional helper reciprocating .extract_first() for .re(), named .re_first(). Use it to
extract just the first matching string:
>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
u'My image 1'
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements
from the document, not only those inside <div> elements:
40
This is the proper way to do it (note the dot prefixing the .//p XPath):
>>> for p in divs.xpath('.//p'):
...
print p.extract()
For more details about relative XPaths see the Location Paths section in the XPath specification.
Using EXSLT extensions
Being built atop lxml, Scrapy selectors also support some EXSLT extensions and come with these pre-registered
namespaces to use in XPath expressions:
prefix
re
set
namespace
http://exslt.org/regular-expressions
http://exslt.org/sets
usage
regular expressions
set manipulation
Regular expressions
The test() function, for example, can prove quite useful when XPaths starts-with() or contains() are
not sufficient.
Example selecting links in list item with a class attribute ending with a digit:
>>> from scrapy import Selector
>>> doc = """
... <div>
...
<ul>
...
<li class="item-0"><a href="link1.html">first item</a></li>
...
<li class="item-1"><a href="link2.html">second item</a></li>
...
<li class="item-inactive"><a href="link3.html">third item</a></li>
...
<li class="item-1"><a href="link4.html">fourth item</a></li>
...
<li class="item-0"><a href="link5.html">fifth item</a></li>
...
</ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').extract()
[u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()
[u'link1.html', u'link2.html', u'link4.html', u'link5.html']
>>>
Warning: C library libxslt doesnt natively support EXSLT regular expressions so lxmls implementation
uses hooks to Pythons re module. Thus, using regexp functions in your XPath expressions may add a small
performance penalty.
3.3. Selectors
41
Set operations
These can be handy for excluding parts of a document tree before extracting text elements for example.
Example extracting microdata (sample content taken from http://schema.org/Product) with groups of itemscopes and
corresponding itemprops:
>>>
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
>>>
>>>
42
doc = """
<div itemscope itemtype="http://schema.org/Product">
<span itemprop="name">Kenmore White 17" Microwave</span>
<img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
<div itemprop="aggregateRating"
itemscope itemtype="http://schema.org/AggregateRating">
Rated <span itemprop="ratingValue">3.5</span>/5
based on <span itemprop="reviewCount">11</span> customer reviews
</div>
<div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<span itemprop="price">$55.00</span>
<link itemprop="availability" href="http://schema.org/InStock" />In stock
</div>
Product description:
<span itemprop="description">0.7 cubic feet countertop microwave.
Has six preset cooking categories and convenience features like
Add-A-Minute and Child Lock.</span>
Customer reviews:
<div itemprop="review" itemscope itemtype="http://schema.org/Review">
<span itemprop="name">Not a happy camper</span> by <span itemprop="author">Ellie</span>,
<meta itemprop="datePublished" content="2011-04-01">April 1, 2011
<div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
<meta itemprop="worstRating" content = "1">
<span itemprop="ratingValue">1</span>/
<span itemprop="bestRating">5</span>stars
</div>
<span itemprop="description">The lamp burned out and now I have to replace
it. </span>
</div>
<div itemprop="review" itemscope itemtype="http://schema.org/Review">
<span itemprop="name">Value purchase</span> by <span itemprop="author">Lucas</span>,
<meta itemprop="datePublished" content="2011-03-25">March 25, 2011
<div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
<meta itemprop="worstRating" content = "1"/>
<span itemprop="ratingValue">4</span>/
<span itemprop="bestRating">5</span>stars
</div>
<span itemprop="description">Great microwave for the price. It is small and
fits in my apartment.</span>
</div>
...
</div>
"""
sel = Selector(text=doc, type="html")
for scope in sel.xpath('//div[@itemscope]'):
...
...
...
...
...
...
Here we first iterate over itemscope elements, and for each one, we look for all itemprops elements and exclude
those that are themselves inside another itemscope.
Some XPath tips
Here are some tips that you may find useful when using XPath with Scrapy selectors, based on this post from ScrapingHubs blog. If you are not much familiar with XPath yet, you may want to take a look first at this XPath tutorial.
Using text nodes in a condition
When you need to use the text content as argument to an XPath string function, avoid using .//text() and use just
. instead.
This is because the expression .//text() yields a collection of text elements a node-set. And when a nodeset is converted to a string, which happens when it is passed as argument to a string function like contains() or
starts-with(), it results in the text for the first element only.
Example:
>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
3.3. Selectors
43
A node converted to a string, however, puts together the text of itself plus of all its descendants:
>>> sel.xpath("//a[1]").extract() # select the first node
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").extract() # convert it to string
[u'Click here to go to the Next Page']
So, using the .//text() node-set wont select anything in this case:
>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract()
[]
//node[1] selects all the nodes occurring first under their respective parents.
(//node)[1] selects all the nodes in the document, and then gets only the first of them.
Example:
>>> from scrapy import Selector
>>> sel = Selector(text="""
....:
<ul class="list">
....:
<li>1</li>
....:
<li>2</li>
....:
<li>3</li>
....:
</ul>
....:
<ul class="list">
....:
<li>4</li>
....:
<li>5</li>
....:
<li>6</li>
....:
</ul>""")
>>> xp = lambda x: sel.xpath(x).extract()
This gets all first <li> elements under whatever it is its parent:
>>> xp("//li[1]")
[u'<li>1</li>', u'<li>4</li>']
And this gets the first <li> element in the whole document:
>>> xp("(//li)[1]")
[u'<li>1</li>']
And this gets the first <li> element under an <ul> parent in the whole document:
44
>>> xp("(//ul/li)[1]")
[u'<li>1</li>']
Because an element can contain multiple CSS classes, the XPath way to select elements by class is the rather verbose:
*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]
If you use @class=someclass you may end up missing elements that have other classes, and if you just use
contains(@class, someclass) to make up for that you may end up with more elements that you want, if
they have a different class name that shares the string someclass.
As it turns out, Scrapy selectors allow you to chain selectors, so most of the time you can just select by class using
CSS and then switch to XPath when needed:
This is cleaner than using the verbose XPath trick shown above. Just remember to use the . in the XPath expressions
that will follow.
3.3. Selectors
45
css(query)
Apply the given CSS selector and return a SelectorList instance.
query is a string containing the CSS selector to apply.
In the background, CSS queries are translated into XPath queries using cssselect library and run
.xpath() method.
Note: For convenience this method can be called as response.css()
extract()
Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.
re(regex)
Apply the given regex and return a list of unicode strings with the matches.
regex can be either a compiled regular expression or a string which will be compiled to a regular expression using re.compile(regex)
register_namespace(prefix, uri)
Register the given namespace to be used in this Selector. Without registering namespaces you cant
select or extract data from non-standard namespaces. See examples below.
remove_namespaces()
Remove all namespaces, allowing to traverse the document using namespace-less xpaths. See example
below.
__nonzero__()
Returns True if there is any real content selected or False otherwise. In other words, the boolean value
of a Selector is given by the contents it selects.
SelectorList objects
class scrapy.selector.SelectorList
The SelectorList class is a subclass of the builtin list class, which provides a few additional methods.
xpath(query)
Call the .xpath() method for each element in this list and return their results flattened as another
SelectorList.
query is the same argument as the one in Selector.xpath()
css(query)
Call the .css() method for each element in this list and return their results flattened as another
SelectorList.
query is the same argument as the one in Selector.css()
extract()
Call the .extract() method for each element in this list and return their results flattened, as a list of
unicode strings.
re()
Call the .re() method for each element in this list and return their results flattened, as a list of unicode
strings.
__nonzero__()
returns True if the list is not empty, False otherwise.
46
Heres a couple of Selector examples to illustrate several concepts. In all cases, we assume there is already a
Selector instantiated with a HtmlResponse object like this:
sel = Selector(html_response)
1. Select all <h1> elements from an HTML response body, returning a list of Selector objects (ie. a
SelectorList object):
sel.xpath("//h1")
2. Extract the text of all <h1> elements from an HTML response body, returning a list of unicode strings:
sel.xpath("//h1").extract()
sel.xpath("//h1/text()").extract()
3. Iterate over all <p> tags and print their class attribute:
for node in sel.xpath("//p"):
print node.xpath("@class").extract()
Heres a couple of examples to illustrate several concepts. In both cases we assume there is already a Selector
instantiated with an XmlResponse object like this:
sel = Selector(xml_response)
1. Select all <product> elements from an XML response body, returning a list of Selector objects (ie. a
SelectorList object):
sel.xpath("//product")
2. Extract all prices from a Google Base XML feed which requires registering a namespace:
sel.register_namespace("g", "http://base.google.com/ns/1.0")
sel.xpath("//g:price").extract()
Removing namespaces
When dealing with scraping projects, it is often quite convenient to get rid of namespaces altogether and just work with
element names, to write more simple/convenient XPaths. You can use the Selector.remove_namespaces()
method for that.
Lets show an example that illustrates this with Github blog atom feed.
First, we open the shell with the url we want to scrape:
$ scrapy shell https://github.com/blog.atom
Once in the shell we can try selecting all <link> objects and see that it doesnt work (because the Atom XML
namespace is obfuscating those nodes):
>>> response.xpath("//link")
[]
3.3. Selectors
47
But once we call the Selector.remove_namespaces() method, all nodes can be accessed directly by their
names:
>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
...
If you wonder why the namespace removal procedure isnt called always by default instead of having to call it manually, this is because of two reasons, which, in order of relevance, are:
1. Removing namespaces requires to iterate and modify all nodes in the document, which is a reasonably expensive
operation to perform for all documents crawled by Scrapy
2. There could be some cases where using namespaces is actually required, in case some element names clash
between namespaces. These cases are very rare though.
3.4 Items
The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. Scrapy spiders
can return the extracted data as Python dicts. While convenient and familiar, Python dicts lack structure: it is easy to
make a typo in a field name or return inconsistent data, especially in a larger project with many spiders.
To define common output data format Scrapy provides the Item class. Item objects are simple containers used to
collect the scraped data. They provide a dictionary-like API with a convenient syntax for declaring their available
fields.
Various Scrapy components use extra information provided by Items: exporters look at declared fields to figure out
columns to export, serialization can be customized using Item fields metadata, trackref tracks Item instances to
help finding memory leaks (see Debugging memory leaks with trackref ), etc.
Note: Those familiar with Django will notice that Scrapy Items are declared similar to Django Models, except that
Scrapy Items are much simpler as there is no concept of different field types.
You can specify any kind of metadata for each field. There is no restriction on the values accepted by Field objects.
For this same reason, there is no reference list of all available metadata keys. Each key defined in Field objects
48
could be used by a different components, and only those components know about it. You can also define and use any
other Field key in your project too, for your own needs. The main goal of Field objects is to provide a way to
define all field metadata in one place. Typically, those components whose behaviour depends on each field use certain
field keys to configure that behaviour. You must refer to their documentation to see which metadata keys are used by
each component.
Its important to note that the Field objects used to declare the item do not stay assigned as class attributes. Instead,
they can be accessed through the Item.fields attribute.
# is last_updated populated?
3.4. Items
49
50
You can also extend field metadata by using the previous field metadata and appending more values, or changing
existing values, like this:
class SpecificProduct(Product):
name = scrapy.Field(Product.fields['name'], serializer=my_serializer)
That adds (or replaces) the serializer metadata key for the name field, keeping all the previously existing metadata values.
51
By quickly looking at that code, we can see the name field is being extracted from two different XPath locations in
the page:
1. //div[@class="product_name"]
2. //div[@class="product_title"]
In other words, data is being collected by extracting it from two XPath locations, using the add_xpath() method.
This is the data that will be assigned to the name field later.
Afterwards, similar calls are used for price and stock fields (the later using a CSS selector with the add_css()
method), and finally the last_update field is populated directly with a literal value (today) using a different
method: add_value().
Finally, when all data is collected, the ItemLoader.load_item() method is called which actually returns
the item populated with the data previously extracted and collected with the add_xpath(), add_css(), and
add_value() calls.
52
As you can see, input processors are declared using the _in suffix while output processors are declared using the _out suffix.
And you can also declare a default input/output processors using
the ItemLoader.default_input_processor and ItemLoader.default_output_processor attributes.
53
The precedence order, for both input and output processors, is as follows:
1. Item Loader field-specific attributes: field_in and field_out (most precedence)
2. Field metadata (input_processor and output_processor key)
3. Item
Loader
defaults:
ItemLoader.default_input_processor()
ItemLoader.default_output_processor() (least precedence)
and
By accepting a loader_context argument the function is explicitly telling the Item Loader that its able to receive
an Item Loader context, so the Item Loader passes the currently active context when calling it, and the processor
function (parse_length in this case) can thus use them.
54
2. On Item Loader instantiation (the keyword arguments of Item Loader constructor are stored in the Item Loader
context):
loader = ItemLoader(product, unit='cm')
3. On Item Loader declaration, for those input/output processors that support instantiating them with an Item
Loader context. MapCompose is one of them:
class ProductLoader(ItemLoader):
length_out = MapCompose(parse_length, unit='cm')
55
The value is first passed through get_value() by giving the processors and kwargs, and then
passed through the field input processor and its result appended to the data collected for that field. If the
field already contains collected data, the new data is added.
The given field_name can be None, in which case values for multiple fields may be added. And the
processed value should be a dict with field_name mapped to values.
Examples:
loader.add_value('name', u'Color TV')
loader.add_value('colours', [u'white', u'blue'])
loader.add_value('length', u'100')
loader.add_value('name', u'name: foo', TakeFirst(), re='name: (.+)')
loader.add_value(None, {'name': u'foo', 'sex': u'male'})
Examples:
# HTML snippet: <p class="product-name">Color TV</p>
loader.get_css('p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_css('p#price', TakeFirst(), re='the price is (.*)')
57
selector
The Selector object to extract data from. Its either the selector given in the constructor or one created
from the response given in the constructor using the default_selector_class. This attribute is
meant to be read-only.
Another case where extending Item Loaders can be very helpful is when you have multiple source formats, for example
XML and HTML. In the XML version you may want to remove CDATA occurrences. Heres an example of how to do
it:
from scrapy.loader.processors import MapCompose
from myproject.ItemLoaders import ProductLoader
from myproject.utils.xml import remove_cdata
class XmlProductLoader(ProductLoader):
name_in = MapCompose(remove_cdata, ProductLoader.name_in)
class scrapy.loader.processors.Identity
The simplest processor, which doesnt do anything. It returns the original values unchanged. It doesnt receive
any constructor arguments, nor does it accept Loader contexts.
Example:
>>> from scrapy.loader.processors import Identity
>>> proc = Identity()
>>> proc(['one', 'two', 'three'])
['one', 'two', 'three']
class scrapy.loader.processors.TakeFirst
Returns the first non-null/non-empty value from the values received, so its typically used as an output processor
to single-valued fields. It doesnt receive any constructor arguments, nor does it accept Loader contexts.
Example:
>>> from scrapy.loader.processors import TakeFirst
>>> proc = TakeFirst()
>>> proc(['', 'one', 'two', 'three'])
'one'
class scrapy.loader.processors.Join(separator=u )
Returns the values joined with the separator given in the constructor, which defaults to u . It doesnt accept
Loader contexts.
When using the default separator, this processor is equivalent to the function: u .join
Examples:
>>> from scrapy.loader.processors import Join
>>> proc = Join()
>>> proc(['one', 'two', 'three'])
u'one two three'
>>> proc = Join('<br>')
>>> proc(['one', 'two', 'three'])
u'one<br>two<br>three'
Each function can optionally receive a loader_context parameter. For those which do, this processor will
pass the currently active Loader context through that parameter.
The keyword arguments passed in the constructor are used as the default Loader context values passed to each
function call. However, the final Loader context values passed to functions are overridden with the currently
active Loader context accessible through the ItemLoader.context() attribute.
class scrapy.loader.processors.MapCompose(*functions, **default_loader_context)
A processor which is constructed from the composition of the given functions, similar to the Compose pro3.5. Item Loaders
59
cessor. The difference with this processor is the way internal results are passed among functions, which is as
follows:
The input value of this processor is iterated and the first function is applied to each element. The results of these
function calls (one for each element) are concatenated to construct a new iterable, which is then used to apply
the second function, and so on, until the last function is applied to each value of the list of values collected so
far. The output values of the last function are concatenated together to produce the output of this processor.
Each particular function can return a value or a list of values, which is flattened with the list of values returned
by the same function applied to the other input values. The functions can also return None in which case the
output of that function is ignored for further processing over the chain.
This processor provides a convenient way to compose functions that only work with single values (instead of
iterables). For this reason the MapCompose processor is typically used as input processor, since data is often
extracted using the extract() method of selectors, which returns a list of unicode strings.
The example below should clarify how it works:
>>> def filter_world(x):
...
return None if x == 'world' else x
...
>>> from scrapy.loader.processors import MapCompose
>>> proc = MapCompose(filter_world, unicode.upper)
>>> proc([u'hello', u'world', u'this', u'is', u'scrapy'])
[u'HELLO, u'THIS', u'IS', u'SCRAPY']
As with the Compose processor, functions can receive Loader contexts, and constructor keyword arguments are
used as default context values. See Compose processor for more info.
class scrapy.loader.processors.SelectJmes(json_path)
Queries the value using the json path provided to the constructor and returns the output. Requires jmespath
(https://github.com/jmespath/jmespath.py) to run. This processor takes only one input at a time.
Example:
>>> from scrapy.loader.processors import SelectJmes, Compose, MapCompose
>>> proc = SelectJmes("foo") #for direct use on lists and dictionaries
>>> proc({'foo': 'bar'})
'bar'
>>> proc({'foo': {'bar': 'baz'}})
{'bar': 'baz'}
60
The shell is used for testing XPath or CSS expressions and see how they work and what data they extract from the web
pages youre trying to scrape. It allows you to interactively test your expressions while youre writing your spider,
without having to run the spider to test every change.
Once you get familiarized with the Scrapy shell, youll see that its an invaluable tool for developing and debugging
your spiders.
If you have IPython installed, the Scrapy shell will use it (instead of the standard Python console). The IPython console
is much more powerful and provides smart auto-completion and colorized output, among other things.
We highly recommend you install IPython, specially if youre working on Unix systems (where IPython excels). See
the IPython installation guide for more info.
61
Then, the shell fetches the URL (using the Scrapy downloader) and prints the list of available objects and useful
shortcuts (youll notice that these lines all start with the [s] prefix):
[s] Available Scrapy objects:
[s]
crawler
<scrapy.crawler.Crawler object at 0x1e16b50>
[s]
item
{}
[s]
request
<GET http://scrapy.org>
[s]
response
<200 http://scrapy.org>
[s]
settings
<scrapy.settings.Settings object at 0x2bfd650>
[s]
spider
<Spider 'default' at 0x20c6f50>
[s] Useful shortcuts:
[s]
shelp()
Shell help (print this help)
[s]
fetch(req_or_url) Fetch request (or URL) and update local objects
[s]
view(response)
View response in a browser
>>>
62
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
"http://example.com",
"http://example.org",
"http://example.net",
]
def parse(self, response):
# We want to inspect one specific response.
if ".org" in response.url:
from scrapy.shell import inspect_response
inspect_response(response, self)
# Rest of parsing code.
When you run the spider, you will get something similar to this:
2014-01-23 17:48:31-0400 [scrapy] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[s] Available Scrapy objects:
[s]
crawler
<scrapy.crawler.Crawler object at 0x1e16b50>
...
>>> response.url
'http://example.org'
Nope, it doesnt. So you can open the response in your web browser and see if its the response you were expecting:
>>> view(response)
True
Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:
>>> ^D
2014-01-23 17:50:03-0400 [scrapy] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
...
Note that you cant use the fetch shortcut here since the Scrapy engine is blocked by the shell. However, after you
leave the shell, the spider will continue crawling where it stopped, as shown above.
63
64
Note: The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store
all scraped items into a JSON file you should use the Feed exports.
65
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert(dict(item))
return item
Duplicates filter
A filter that looks for duplicate items, and drops those items that were already processed. Lets say that our items have
a unique id, but our spider returns multiples items with the same id:
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
The integer values you assign to classes in this setting determine the order in which they run: items go through from
lower valued to higher valued classes. Its customary to define these numbers in the 0-1000 range.
66
One of the most frequently required features when implementing scrapers is being able to store the scraped data
properly and, quite often, that means generating an export file with the scraped data (commonly called export
feed) to be consumed by other systems.
Scrapy provides this functionality out of the box with the Feed Exports, which allows you to generate a feed with the
scraped items, using multiple serialization formats and storage backends.
67
Marshal
FEED_FORMAT: marshal
Exporter used: MarshalItemExporter
3.8.2 Storages
When using the feed exports you define where to store the feed using a URI (through the FEED_URI setting). The
feed exports supports multiple storage backend types which are defined by the URI scheme.
The storages backends supported out of the box are:
Local filesystem
FTP
S3 (requires boto)
Standard output
Some storage backends may be unavailable if the required external libraries are not available. For example, the S3
backend is only available if the boto library is installed.
68
FTP
The feeds are stored in a FTP server.
URI scheme: ftp
Example URI: ftp://user:[email protected]/path/to/export.csv
Required external libraries: none
S3
The feeds are stored on Amazon S3.
URI scheme: s3
Example URIs:
s3://mybucket/path/to/export.csv
s3://aws_key:aws_secret@mybucket/path/to/export.csv
Required external libraries: boto
The AWS credentials can be passed as user/password in the URI, or they can be passed through the following settings:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
Standard output
The feeds are written to the standard output of the Scrapy process.
URI scheme: stdout
Example URI: stdout:
Required external libraries: none
3.8.5 Settings
These are the settings used for configuring the feed exports:
FEED_URI (mandatory)
FEED_FORMAT
FEED_STORAGES
FEED_EXPORTERS
FEED_STORE_EMPTY
FEED_EXPORT_FIELDS
69
FEED_URI
Default: None
The URI of the export feed. See Storage backends for supported URI schemes.
This setting is required for enabling the feed exports.
FEED_FORMAT
The serialization format to be used for the feed. See Serialization formats for possible values.
FEED_EXPORT_FIELDS
Default: None
A list of fields to export, optional. Example: FEED_EXPORT_FIELDS = ["foo", "bar", "baz"].
Use FEED_EXPORT_FIELDS option to define fields to export and their order.
When FEED_EXPORT_FIELDS is empty or None (default), Scrapy uses fields defined in dicts or Item subclasses a
spider is yielding.
If an exporter requires a fixed set of fields (this is the case for CSV export format) and FEED_EXPORT_FIELDS is
empty or None, then Scrapy tries to infer field names from the exported data - currently it uses field names from the
first item.
FEED_STORE_EMPTY
Default: False
Whether to export empty feeds (ie. feeds with no items).
FEED_STORAGES
Default:: {}
A dict containing additional feed storage backends supported by your project. The keys are URI schemes and the
values are paths to storage classes.
FEED_STORAGES_BASE
Default:
{
'': 'scrapy.extensions.feedexport.FileFeedStorage',
'file': 'scrapy.extensions.feedexport.FileFeedStorage',
'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
's3': 'scrapy.extensions.feedexport.S3FeedStorage',
'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}
70
FEED_EXPORTERS
Default:: {}
A dict containing additional exporters supported by your project. The keys are URI schemes and the values are paths
to Item exporter classes.
FEED_EXPORTERS_BASE
Default:
FEED_EXPORTERS_BASE = {
'json': 'scrapy.exporters.JsonItemExporter',
'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
'csv': 'scrapy.exporters.CsvItemExporter',
'xml': 'scrapy.exporters.XmlItemExporter',
'marshal': 'scrapy.exporters.MarshalItemExporter',
}
71
string is stored. Regardless of the type of this argument, the final value stored will be a str
(never unicode or None).
headers (dict) the headers of this request. The dict values can be strings (for single
valued headers) or lists (for multi-valued headers). If None is passed as value, the HTTP
header will not be sent at all.
cookies (dict or list) the request cookies. These can be sent in two forms.
1. Using a dict:
request_with_cookies = Request(url="http://www.example.com",
cookies={'currency': 'USD', 'country': 'UY'})
The latter form allows for customizing the domain and path attributes of the cookie. This
is only useful if the cookies are saved for later requests.
When some site returns cookies (in a response) those are stored in the cookies for that
domain and will be sent again in future requests. Thats the typical behaviour of any regular
web browser. However, if, for some reason, you want to avoid merging with existing cookies
you can instruct Scrapy to do so by setting the dont_merge_cookies key to True in the
Request.meta.
Example of request without merging cookies:
request_with_cookies = Request(url="http://www.example.com",
cookies={'currency': 'USD', 'country': 'UY'},
meta={'dont_merge_cookies': True})
72
method
A string representing the HTTP method in the request. This is guaranteed to be uppercase. Example:
"GET", "POST", "PUT", etc
headers
A dictionary-like object which contains the request headers.
body
A str that contains the request body.
This attribute is read-only. To change the body of a Request use replace().
meta
A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually
populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this
dict depends on the extensions you have enabled.
See Request.meta special keys for a list of special meta keys recognized by Scrapy.
This dict is shallow copied when the request is cloned using the copy() or replace() methods, and
can also be accessed, in your spider, from the response.meta attribute.
copy()
Return a new Request which is a copy of this Request. See also: Passing additional data to callback
functions.
replace([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback ])
Return a Request object with the same members, except for those members given new values by whichever
keyword arguments are specified. The attribute Request.meta is copied by default (unless a new value
is given in the meta argument). See also Passing additional data to callback functions.
Passing additional data to callback functions
The callback of a request is a function that will be called when the response of that request is downloaded. The
callback function will be called with the downloaded Response object as its first argument.
Example:
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments
later, in the second callback. You can use the Request.meta attribute for that.
Heres an example of how to pass an item using this mechanism, to populate different fields from different pages:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
73
item = response.meta['item']
item['other_url'] = response.url
return item
74
Parameters formdata (dict or iterable of tuples) is a dictionary (or iterable of (key, value)
tuples) containing HTML Form data which will be url-encoded and assigned to the body of the
request.
The FormRequest objects support the following class method in addition to the standard Request methods:
classmethod from_response(response[, formname=None, formnumber=0, formdata=None, formxpath=None, clickdata=None, dont_click=False, ... ])
Returns a new FormRequest object with its form field values pre-populated with those found in
the HTML <form> element contained in the given response. For an example see Using FormRequest.from_response() to simulate a user login.
The policy is to automatically simulate a click, by default, on any form control that looks clickable, like
a <input type="submit">. Even though this is quite convenient, and often the desired behaviour,
sometimes it can cause problems which could be hard to debug. For example, when working with forms
that are filled and/or submitted using javascript, the default from_response() behaviour may not be
the most appropriate. To disable this behaviour you can set the dont_click argument to True. Also, if
you want to change the control clicked (instead of disabling it) you can also use the clickdata argument.
Parameters
response (Response object) the response containing a HTML form which will be
used to pre-populate the form fields
formname (string) if given, the form with name attribute set to this value will be used.
formxpath (string) if given, the first form that matches the xpath will be used.
formnumber (integer) the number of form to use, when the response contains multiple
forms. The first one (and also the default) is 0.
formdata (dict) fields to override in the form data. If a field was already present in the
response <form> element, its value is overridden by the one passed in this parameter.
clickdata (dict) attributes to lookup the control clicked. If its not given, the form
data will be submitted simulating a click on the first clickable element. In addition to html
attributes, the control can be identified by its zero-based index relative to other submittable
inputs inside the form, via the nr attribute.
dont_click (boolean) If True, the form data will be submitted without clicking in
any element.
The other parameters of this class method are passed directly to the FormRequest constructor.
New in version 0.10.3: The formname parameter.
New in version 0.17: The formxpath parameter.
Request usage examples
Using FormRequest to send data via HTTP POST
If you want to simulate a HTML Form POST in your spider and send a couple of key-value fields, you can return a
FormRequest object (from your spider) like this:
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
75
It is usual for web sites to provide pre-populated form fields through <input type="hidden"> elements, such
as session related data or authentication tokens (for login pages). When scraping, youll want these fields to be
automatically pre-populated and only override a couple of them, such as the user name and password. You can use the
FormRequest.from_response() method for this job. Heres an example spider which uses it:
import scrapy
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
76
headers
A dictionary-like object which contains the response headers.
body
A str containing the body of this Response. Keep in mind that Response.body is always a str. If you want
the unicode version use TextResponse.body_as_unicode() (only available in TextResponse
and subclasses).
This attribute is read-only. To change the body of a Response use replace().
request
The Request object that generated this response. This attribute is assigned in the Scrapy engine, after
the response and the request have passed through all Downloader Middlewares. In particular, this means
that:
HTTP redirections will cause the original request (to the URL before redirection) to be assigned to
the redirected response (with the final URL after redirection).
Response.request.url doesnt always equal Response.url
This attribute is only available in the spider code, and in the Spider Middlewares, but not in Downloader Middlewares (although you have the Request available there by other means) and handlers of
the response_downloaded signal.
meta
A shortcut to the Request.meta
self.request.meta).
attribute
of
the
Response.request
object
(ie.
Unlike the Response.request attribute, the Response.meta attribute is propagated along redirects
and retries, so you will get the original Request.meta sent from your spider.
See also:
Request.meta attribute
flags
A list that contains flags for this response. Flags are labels used for tagging Responses. For example:
cached, redirected, etc. And theyre shown on the string representation of the Response (__str__
method) which is used by the engine for logging.
copy()
Returns a new Response which is a copy of this Response.
replace([url, status, headers, body, request, flags, cls ])
Returns a Response object with the same members, except for those members given new values by
whichever keyword arguments are specified. The attribute Response.meta is copied by default.
urljoin(url)
Constructs an absolute url by combining the Responses url with a possible relative url.
This is a wrapper over urlparse.urljoin, its merely an alias for making this call:
urlparse.urljoin(response.url, url)
77
TextResponse objects
class scrapy.http.TextResponse(url[, encoding[, ... ]])
TextResponse objects adds encoding capabilities to the base Response class, which is meant to be used
only for binary data, such as images, sounds or any media file.
TextResponse objects support a new constructor argument, in addition to the base Response objects. The
remaining functionality is the same as for the Response class and is not documented here.
Parameters encoding (string) is a string which contains the encoding to use for this response.
If you create a TextResponse object with a unicode body, it will be encoded using this
encoding (remember the body attribute is always a string). If encoding is None (default
value), the encoding will be looked up in the response headers and body instead.
TextResponse objects support the following attributes in addition to the standard Response ones:
encoding
A string with the encoding of this response. The encoding is resolved by trying the following mechanisms,
in order:
1.the encoding passed in the constructor encoding argument
2.the encoding declared in the Content-Type HTTP header. If this encoding is not valid (ie. unknown),
it is ignored and the next resolution mechanism is tried.
3.the encoding declared in the response body. The TextResponse class doesnt provide any special
functionality for this. However, the HtmlResponse and XmlResponse classes do.
4.the encoding inferred by looking at the response body. This is the more fragile method but also the
last one tried.
selector
A Selector instance using the response as target. The selector is lazily instantiated on first access.
TextResponse objects support the following methods in addition to the standard Response ones:
body_as_unicode()
Returns the body of the response as unicode. This is equivalent to:
response.body.decode(response.encoding)
Since, in the latter case, you would be using the system default encoding (typically ascii) to convert the
body to unicode, instead of the response encoding.
xpath(query)
A shortcut to TextResponse.selector.xpath(query):
response.xpath('//p')
css(query)
A shortcut to TextResponse.selector.css(query):
response.css('p')
78
HtmlResponse objects
class scrapy.http.HtmlResponse(url[, ... ])
The HtmlResponse class is a subclass of TextResponse which adds encoding auto-discovering support
by looking into the HTML meta http-equiv attribute. See TextResponse.encoding.
XmlResponse objects
class scrapy.http.XmlResponse(url[, ... ])
The XmlResponse class is a subclass of TextResponse which adds encoding auto-discovering support by
looking into the XML declaration line. See TextResponse.encoding.
There used to be other link extractor classes in previous Scrapy versions, but they are deprecated now.
LxmlLinkExtractor
class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(),
deny=(),
allow_domains=(),
deny_domains=(),
deny_extensions=None,
restrict_xpaths=(),
restrict_css=(),
tags=(a,
area), attrs=(href, ), canonicalize=True,
unique=True,
process_value=None)
LxmlLinkExtractor is the recommended link extractor with handy filtering options. It is implemented using
lxmls robust HTMLParser.
Parameters
79
allow (a regular expression (or list of)) a single regular expression (or list of regular
expressions) that the (absolute) urls must match in order to be extracted. If not given (or
empty), it will match all links.
deny (a regular expression (or list of)) a single regular expression (or list of regular
expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted).
It has precedence over the allow parameter. If not given (or empty) it wont exclude any
links.
allow_domains (str or list) a single value or a list of string containing domains which
will be considered for extracting the links
deny_domains (str or list) a single value or a list of strings containing domains which
wont be considered for extracting the links
deny_extensions (list) a single value or list of strings containing extensions
that should be ignored when extracting links. If not given, it will default to the
IGNORED_EXTENSIONS list defined in the scrapy.linkextractors package.
restrict_xpaths (str or list) is an XPath (or list of XPaths) which defines regions
inside the response where links should be extracted from. If given, only the text selected by
those XPath will be scanned for links. See examples below.
restrict_css (str or list) a CSS selector (or list of selectors) which defines regions
inside the response where links should be extracted from. Has the same behaviour as
restrict_xpaths.
tags (str or list) a tag or a list of tags to consider when extracting links. Defaults to
(a, area).
attrs (list) an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults
to (href,)
canonicalize
(boolean)
canonicalize
scrapy.utils.url.canonicalize_url). Defaults to True.
each
extracted
url
(using
3.11 Settings
The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions,
pipelines and spiders themselves.
80
The infrastructure of the settings provides a global namespace of key-value mappings that the code can use to pull
configuration values from. The settings can be populated through different mechanisms, which are described below.
The settings are also the mechanism for selecting the currently active Scrapy project (in case you have many).
For a list of available built-in settings see: Built-in settings reference.
2. Settings per-spider
Spiders (See the Spiders chapter for reference) can define their own settings that will take precedence and override the
project ones. They can do so by setting their scrapy.spiders.Spider.custom_settings attribute.
3. Project settings module
The project settings module is the standard configuration file for your Scrapy project. Its where most of your custom
settings will be populated. For example:: myproject.settings.
3.11. Settings
81
In other words, settings can be accessed like a dict, but its usually preferred to extract the setting in the format you
need it to avoid type errors. In order to do that youll have to use one of the methods provided the Settings API.
82
AWS_SECRET_ACCESS_KEY
Default: None
The AWS secret key used by code that requires access to Amazon Web services, such as the S3 feed storage backend.
BOT_NAME
Default: scrapybot
The name of the bot implemented by this Scrapy project (also known as the project name). This will be used to
construct the User-Agent by default, and also for logging.
Its automatically populated with your project name when you create your project with the startproject command.
CONCURRENT_ITEMS
Default: 100
Maximum number of concurrent items (per response) to process in parallel in the Item Processor (also known as the
Item Pipeline).
CONCURRENT_REQUESTS
Default: 16
The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.
CONCURRENT_REQUESTS_PER_DOMAIN
Default: 8
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.
CONCURRENT_REQUESTS_PER_IP
Default: 0
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If nonzero, the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words,
concurrency limits will be applied per IP, not per domain.
This setting also affects DOWNLOAD_DELAY: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay
is enforced per IP, not per domain.
DEFAULT_ITEM_CLASS
Default: scrapy.item.Item
The default class that will be used for instantiating items in the the Scrapy shell.
3.11. Settings
83
DEFAULT_REQUEST_HEADERS
Default:
{
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
The default headers used for Scrapy HTTP Requests. Theyre populated in the DefaultHeadersMiddleware.
DEPTH_LIMIT
Default: 0
The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.
DEPTH_PRIORITY
Default: 0
An integer that is used to adjust the request priority based on its depth.
If zero, no priority adjustment is made from depth.
DEPTH_STATS
Default: True
Whether to collect maximum depth stats.
DEPTH_STATS_VERBOSE
Default: False
Whether to collect verbose depth stats. If this is enabled, the number of requests for each depth is collected in the
stats.
DNSCACHE_ENABLED
Default: True
Whether to enable DNS in-memory cache.
DNSCACHE_SIZE
Default: 10000
DNS in-memory cache size.
84
DNS_TIMEOUT
Default: 60
Timeout for processing of DNS queries in seconds. Float is supported.
DOWNLOADER
Default: scrapy.core.downloader.Downloader
The downloader to use for crawling.
DOWNLOADER_MIDDLEWARES
Default:: {}
A dict containing the downloader middlewares enabled in your project, and their orders. For more info see Activating
a downloader middleware.
DOWNLOADER_MIDDLEWARES_BASE
Default:
{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 500,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}
A dict containing the downloader middlewares enabled by default in Scrapy. You should never modify this setting in
your project, modify DOWNLOADER_MIDDLEWARES instead. For more info see Activating a downloader middleware.
DOWNLOADER_STATS
Default: True
Whether to enable downloader stats collection.
DOWNLOAD_DELAY
Default: 0
3.11. Settings
85
The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same
website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are
supported. Example:
DOWNLOAD_DELAY = 0.25
# 250 ms of delay
This setting is also affected by the RANDOMIZE_DOWNLOAD_DELAY setting (which is enabled by default). By
default, Scrapy doesnt wait a fixed amount of time between requests, but uses a random interval between 0.5 and 1.5
* DOWNLOAD_DELAY.
When CONCURRENT_REQUESTS_PER_IP is non-zero, delays are enforced per ip address instead of per domain.
You can also change this setting per spider by setting download_delay spider attribute.
DOWNLOAD_HANDLERS
Default: {}
A dict containing the request downloader handlers enabled in your project. See DOWNLOAD_HANDLERS_BASE for
example format.
DOWNLOAD_HANDLERS_BASE
Default:
{
'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
'http': 'scrapy.core.downloader.handlers.http.HttpDownloadHandler',
'https': 'scrapy.core.downloader.handlers.http.HttpDownloadHandler',
's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
}
A dict containing the request download handlers enabled by default in Scrapy. You should never modify this setting in
your project, modify DOWNLOAD_HANDLERS instead.
If you want to disable any of the above download handlers you must define them in your projects
DOWNLOAD_HANDLERS setting and assign None as their value. For example, if you want to disable the file download
handler:
DOWNLOAD_HANDLERS = {
'file': None,
}
DOWNLOAD_TIMEOUT
Default: 180
The amount of time (in secs) that the downloader will wait before timing out.
Note: This timeout can be set per spider using download_timeout spider attribute and per-request using
download_timeout Request.meta key.
86
DOWNLOAD_MAXSIZE
Default: 1073741824 (1024MB)
The maximum response size (in bytes) that downloader will download.
If you want to disable it set to 0.
Note:
This size can be set per spider using download_maxsize spider attribute and per-request using
download_maxsize Request.meta key.
This feature needs Twisted >= 11.1.
DOWNLOAD_WARNSIZE
Default: 33554432 (32MB)
The response size (in bytes) that downloader will start to warn.
If you want to disable it set to 0.
Note:
This size can be set per spider using download_warnsize spider attribute and per-request using
download_warnsize Request.meta key.
This feature needs Twisted >= 11.1.
DUPEFILTER_CLASS
Default: scrapy.dupefilters.RFPDupeFilter
The class used to detect and filter duplicate requests.
The
default
(RFPDupeFilter)
filters
based
on
request
fingerprint
using
the
scrapy.utils.request.request_fingerprint function. In order to change the way duplicates are
checked you could subclass RFPDupeFilter and override its request_fingerprint method. This method
should accept scrapy Request object and return its fingerprint (a string).
DUPEFILTER_DEBUG
Default: False
By default, RFPDupeFilter only logs the first duplicate request. Setting DUPEFILTER_DEBUG to True will
make it log all duplicate requests.
EDITOR
Default: depends on the environment
The editor to use for editing spiders with the edit command. It defaults to the EDITOR environment variable, if set.
Otherwise, it defaults to vi (on Unix systems) or the IDLE editor (on Windows).
3.11. Settings
87
EXTENSIONS
Default:: {}
A dict containing the extensions enabled in your project, and their orders.
EXTENSIONS_BASE
Default:
{
'scrapy.extensions.corestats.CoreStats': 0,
'scrapy.telnet.TelnetConsole': 0,
'scrapy.extensions.memusage.MemoryUsage': 0,
'scrapy.extensions.memdebug.MemoryDebugger': 0,
'scrapy.extensions.closespider.CloseSpider': 0,
'scrapy.extensions.feedexport.FeedExporter': 0,
'scrapy.extensions.logstats.LogStats': 0,
'scrapy.extensions.spiderstate.SpiderState': 0,
'scrapy.extensions.throttle.AutoThrottle': 0,
}
The list of available extensions. Keep in mind that some of them need to be enabled through a setting. By default, this
setting contains all stable built-in extensions.
For more information See the extensions user guide and the list of available extensions.
ITEM_PIPELINES
Default: {}
A dict containing the item pipelines to use, and their orders. The dict is empty by default order values are arbitrary but
its customary to define them in the 0-1000 range.
Lists are supported in ITEM_PIPELINES for backwards compatibility, but they are deprecated.
Example:
ITEM_PIPELINES = {
'mybot.pipelines.validate.ValidateMyItem': 300,
'mybot.pipelines.validate.StoreMyItem': 800,
}
ITEM_PIPELINES_BASE
Default: {}
A dict containing the pipelines enabled by default in Scrapy. You should never modify this setting in your project,
modify ITEM_PIPELINES instead.
LOG_ENABLED
Default: True
Whether to enable logging.
88
LOG_ENCODING
Default: utf-8
The encoding to use for logging.
LOG_FILE
Default: None
File name to use for logging output. If None, standard error will be used.
LOG_FORMAT
Default: %(asctime)s [%(name)s] %(levelname)s:
%(message)s
String for formatting log messsages. Refer to the Python logging documentation for the whole list of available placeholders.
LOG_DATEFORMAT
Default: %Y-%m-%d %H:%M:%S
String for formatting date/time, expansion of the %(asctime)s placeholder in LOG_FORMAT. Refer to the Python
datetime documentation for the whole list of available directives.
LOG_LEVEL
Default: DEBUG
Minimum level to log. Available levels are: CRITICAL, ERROR, WARNING, INFO, DEBUG. For more info see
Logging.
LOG_STDOUT
Default: False
If True, all standard output (and error) of your process will be redirected to the log. For example if you print
hello it will appear in the Scrapy log.
MEMDEBUG_ENABLED
Default: False
Whether to enable memory debugging.
3.11. Settings
89
MEMDEBUG_NOTIFY
Default: []
When memory debugging is enabled a memory report will be sent to the specified addresses if this setting is not empty,
otherwise the report will be written to the log.
Example:
MEMDEBUG_NOTIFY = ['[email protected]']
MEMUSAGE_ENABLED
Default: False
Scope: scrapy.extensions.memusage
Whether to enable the memory usage extension that will shutdown the Scrapy process when it exceeds a memory limit,
and also notify by email when that happened.
See Memory usage extension.
MEMUSAGE_LIMIT_MB
Default: 0
Scope: scrapy.extensions.memusage
The maximum amount of memory to allow (in megabytes) before shutting down Scrapy (if MEMUSAGE_ENABLED
is True). If zero, no check will be performed.
See Memory usage extension.
MEMUSAGE_NOTIFY_MAIL
Default: False
Scope: scrapy.extensions.memusage
A list of emails to notify if the memory limit has been reached.
Example:
MEMUSAGE_NOTIFY_MAIL = ['[email protected]']
90
MEMUSAGE_WARNING_MB
Default: 0
Scope: scrapy.extensions.memusage
The maximum amount of memory to allow (in megabytes) before sending a warning email notifying about it. If zero,
no warning will be produced.
NEWSPIDER_MODULE
Default:
Module where to create new spiders using the genspider command.
Example:
NEWSPIDER_MODULE = 'mybot.spiders_dev'
RANDOMIZE_DOWNLOAD_DELAY
Default: True
If enabled, Scrapy will wait a random amount of time (between 0.5 and 1.5 * DOWNLOAD_DELAY) while fetching
requests from the same website.
This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which
analyze requests looking for statistically significant similarities in the time between their requests.
The randomization policy is the same used by wget --random-wait option.
If DOWNLOAD_DELAY is zero (default) this option has no effect.
REACTOR_THREADPOOL_MAXSIZE
Default: 10
The maximum limit for Twisted Reactor thread pool size. This is common multi-purpose thread pool used by various
Scrapy components. Threaded DNS Resolver, BlockingFeedStorage, S3FilesStore just to name a few. Increase this
value if youre experiencing problems with insufficient blocking IO.
REDIRECT_MAX_TIMES
Default: 20
Defines the maximum times a request can be redirected. After this maximum the requests response is returned as is.
We used Firefox default value for the same task.
REDIRECT_MAX_METAREFRESH_DELAY
Default: 100
Some sites use meta-refresh for redirecting to a session expired page, so we restrict automatic redirection to a maximum delay (in seconds)
3.11. Settings
91
REDIRECT_PRIORITY_ADJUST
Default: +2
Adjust redirect request priority relative to original request. A negative priority adjust means more priority.
ROBOTSTXT_OBEY
Default: False
Scope: scrapy.downloadermiddlewares.robotstxt
If enabled, Scrapy will respect robots.txt policies. For more information see RobotsTxtMiddleware
SCHEDULER
Default: scrapy.core.scheduler.Scheduler
The scheduler to use for crawling.
SPIDER_CONTRACTS
Default:: {}
A dict containing the scrapy contracts enabled in your project, used for testing spiders. For more info see Spiders
Contracts.
SPIDER_CONTRACTS_BASE
Default:
{
'scrapy.contracts.default.UrlContract' : 1,
'scrapy.contracts.default.ReturnsContract': 2,
'scrapy.contracts.default.ScrapesContract': 3,
}
A dict containing the scrapy contracts enabled by default in Scrapy. You should never modify this setting in your
project, modify SPIDER_CONTRACTS instead. For more info see Spiders Contracts.
SPIDER_LOADER_CLASS
Default: scrapy.spiderloader.SpiderLoader
The class that will be used for loading spiders, which must implement the SpiderLoader API.
SPIDER_MIDDLEWARES
Default:: {}
A dict containing the spider middlewares enabled in your project, and their orders. For more info see Activating a
spider middleware.
92
SPIDER_MIDDLEWARES_BASE
Default:
{
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
}
A dict containing the spider middlewares enabled by default in Scrapy. You should never modify this setting in your
project, modify SPIDER_MIDDLEWARES instead. For more info see Activating a spider middleware.
SPIDER_MODULES
Default: []
A list of modules where Scrapy will look for spiders.
Example:
SPIDER_MODULES = ['mybot.spiders_prod', 'mybot.spiders_dev']
STATS_CLASS
Default: scrapy.statscollectors.MemoryStatsCollector
The class to use for collecting stats, who must implement the Stats Collector API.
STATS_DUMP
Default: True
Dump the Scrapy stats (to the Scrapy log) once the spider finishes.
For more info see: Stats Collection.
STATSMAILER_RCPTS
Default: [] (empty list)
Send Scrapy stats after spiders finish scraping. See StatsMailer for more info.
TELNETCONSOLE_ENABLED
Default: True
A boolean which specifies if the telnet console will be enabled (provided its extension is also enabled).
3.11. Settings
93
TELNETCONSOLE_PORT
Default: [6023, 6073]
The port range to use for the telnet console. If set to None or 0, a dynamically assigned port is used. For more info
see Telnet Console.
TEMPLATES_DIR
Default: templates dir inside scrapy module
The directory where to look for templates when creating new projects with startproject command.
URLLENGTH_LIMIT
Default: 2083
Scope: spidermiddlewares.urllength
The maximum URL length to allow for crawled URLs. For more information about the default value for this setting
see: http://www.boutell.com/newfaq/misc/urllength.html
USER_AGENT
Default: "Scrapy/VERSION (+http://scrapy.org)"
The default User-Agent to use when crawling, unless overridden.
Settings documented elsewhere:
The following settings are documented elsewhere, please check each specific case to see how to enable and use them.
AJAXCRAWL_ENABLED
AUTOTHROTTLE_DEBUG
AUTOTHROTTLE_ENABLED
AUTOTHROTTLE_MAX_DELAY
AUTOTHROTTLE_START_DELAY
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
BOT_NAME
CLOSESPIDER_ERRORCOUNT
CLOSESPIDER_ITEMCOUNT
CLOSESPIDER_PAGECOUNT
CLOSESPIDER_TIMEOUT
COMMANDS_MODULE
COMPRESSION_ENABLED
CONCURRENT_ITEMS
94
CONCURRENT_REQUESTS
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
COOKIES_DEBUG
COOKIES_ENABLED
DEFAULT_ITEM_CLASS
DEFAULT_REQUEST_HEADERS
DEPTH_LIMIT
DEPTH_PRIORITY
DEPTH_STATS
DEPTH_STATS_VERBOSE
DNSCACHE_ENABLED
DNSCACHE_SIZE
DNS_TIMEOUT
DOWNLOADER
DOWNLOADER_MIDDLEWARES
DOWNLOADER_MIDDLEWARES_BASE
DOWNLOADER_STATS
DOWNLOAD_DELAY
DOWNLOAD_HANDLERS
DOWNLOAD_HANDLERS_BASE
DOWNLOAD_MAXSIZE
DOWNLOAD_TIMEOUT
DOWNLOAD_WARNSIZE
DUPEFILTER_CLASS
DUPEFILTER_DEBUG
EDITOR
EXTENSIONS
EXTENSIONS_BASE
FEED_EXPORTERS
FEED_EXPORTERS_BASE
FEED_EXPORT_FIELDS
FEED_FORMAT
FEED_STORAGES
FEED_STORAGES_BASE
FEED_STORE_EMPTY
3.11. Settings
95
FEED_URI
FILES_EXPIRES
FILES_STORE
HTTPCACHE_DBM_MODULE
HTTPCACHE_DIR
HTTPCACHE_ENABLED
HTTPCACHE_EXPIRATION_SECS
HTTPCACHE_GZIP
HTTPCACHE_IGNORE_HTTP_CODES
HTTPCACHE_IGNORE_MISSING
HTTPCACHE_IGNORE_SCHEMES
HTTPCACHE_POLICY
HTTPCACHE_STORAGE
HTTPERROR_ALLOWED_CODES
HTTPERROR_ALLOW_ALL
IMAGES_EXPIRES
IMAGES_MIN_HEIGHT
IMAGES_MIN_WIDTH
IMAGES_STORE
IMAGES_THUMBS
ITEM_PIPELINES
ITEM_PIPELINES_BASE
LOG_DATEFORMAT
LOG_ENABLED
LOG_ENCODING
LOG_FILE
LOG_FORMAT
LOG_LEVEL
LOG_STDOUT
MAIL_FROM
MAIL_HOST
MAIL_PASS
MAIL_PORT
MAIL_SSL
MAIL_TLS
MAIL_USER
96
MEMDEBUG_ENABLED
MEMDEBUG_NOTIFY
MEMUSAGE_ENABLED
MEMUSAGE_LIMIT_MB
MEMUSAGE_NOTIFY_MAIL
MEMUSAGE_REPORT
MEMUSAGE_WARNING_MB
METAREFRESH_ENABLED
NEWSPIDER_MODULE
RANDOMIZE_DOWNLOAD_DELAY
REACTOR_THREADPOOL_MAXSIZE
REDIRECT_ENABLED
REDIRECT_MAX_METAREFRESH_DELAY
REDIRECT_MAX_METAREFRESH_DELAY
REDIRECT_MAX_TIMES
REDIRECT_MAX_TIMES
REDIRECT_PRIORITY_ADJUST
REFERER_ENABLED
RETRY_ENABLED
RETRY_HTTP_CODES
RETRY_TIMES
ROBOTSTXT_OBEY
SCHEDULER
SPIDER_CONTRACTS
SPIDER_CONTRACTS_BASE
SPIDER_LOADER_CLASS
SPIDER_MIDDLEWARES
SPIDER_MIDDLEWARES_BASE
SPIDER_MODULES
STATSMAILER_RCPTS
STATS_CLASS
STATS_DUMP
TELNETCONSOLE_ENABLED
TELNETCONSOLE_HOST
TELNETCONSOLE_PORT
TELNETCONSOLE_PORT
3.11. Settings
97
TEMPLATES_DIR
URLLENGTH_LIMIT
USER_AGENT
3.12 Exceptions
3.12.1 Built-in Exceptions reference
Heres a list of all exceptions included in Scrapy and their usage.
DropItem
exception scrapy.exceptions.DropItem
The exception that must be raised by item pipeline stages to stop processing an Item. For more information see Item
Pipeline.
CloseSpider
exception scrapy.exceptions.CloseSpider(reason=cancelled)
This exception can be raised from a spider callback to request the spider to be closed/stopped. Supported
arguments:
Parameters reason (str) the reason for closing
For example:
def parse_page(self, response):
if 'Bandwidth exceeded' in response.body:
raise CloseSpider('bandwidth_exceeded')
IgnoreRequest
exception scrapy.exceptions.IgnoreRequest
This exception can be raised by the Scheduler or any downloader middleware to indicate that the request should be
ignored.
NotConfigured
exception scrapy.exceptions.NotConfigured
This exception can be raised by some components to indicate that they will remain disabled. Those components
include:
Extensions
Item pipelines
Downloader middlwares
Spider middlewares
The exception must be raised in the component constructor.
98
NotSupported
exception scrapy.exceptions.NotSupported
This exception is raised to indicate an unsupported feature.
Command line tool Learn about the command-line tool used to manage your Scrapy project.
Spiders Write the rules to crawl your websites.
Selectors Extract the data from web pages using XPath.
Scrapy shell Test your extraction code in an interactive environment.
Items Define the data you want to scrape.
Item Loaders Populate your items with the extracted data.
Item Pipeline Post-process and store your scraped data.
Feed exports Output your scraped data using different formats and storages.
Requests and Responses Understand the classes used to represent HTTP requests and responses.
Link Extractors Convenient classes to extract links to follow from pages.
Settings Learn how to configure Scrapy and see all available settings.
Exceptions See all available exceptions and their meaning.
3.12. Exceptions
99
100
CHAPTER 4
Built-in services
4.1 Logging
Note: scrapy.log has been deprecated alongside its functions in favor of explicit calls to the Python standard
logging. Keep reading to learn more about the new logging system.
Scrapy uses Pythons builtin logging system for event logging. Well provide some simple examples to get you started,
but for more advanced use-cases its strongly suggested to read thoroughly its documentation.
Logging works out of the box, and can be configured to some extent with the Scrapy settings listed in Logging settings.
Scrapy calls scrapy.utils.log.configure_logging() to set some reasonable defaults and handle those
settings in Logging settings when running commands, so its recommended to manually call it if youre running Scrapy
from scripts as described in Run Scrapy from a script.
There are shortcuts for issuing log messages on any of the standard 5 levels, and theres also a general logging.log
method which takes a given level as argument. If you need so, last example could be rewrote as:
101
import logging
logging.log(logging.WARNING, "This is a warning")
On top of that, you can create different loggers to encapsulate messages (For example, a common practice its to
create different loggers for every module). These loggers can be configured independently, and they allow hierarchical
constructions.
Last examples use the root logger behind the scenes, which is a top level logger where all messages are propagated
to (unless otherwise specified). Using logging helpers is merely a shortcut for getting the root logger explicitly, so
this is also an equivalent of last snippets:
import logging
logger = logging.getLogger()
logger.warning("This is a warning")
You can use a different logger just by getting its name with the logging.getLogger function:
import logging
logger = logging.getLogger('mycustomlogger')
logger.warning("This is a warning")
Finally, you can ensure having a custom logger for any module youre working on by using the __name__ variable,
which is populated with current modules path:
import logging
logger = logging.getLogger(__name__)
logger.warning("This is a warning")
See also:
Module logging, HowTo Basic Logging Tutorial
Module logging, Loggers Further documentation on loggers
That logger is created using the Spiders name, but you can use any custom Python logger you want. For example:
import logging
import scrapy
logger = logging.getLogger('mycustomlogger')
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://scrapinghub.com']
102
4.1. Logging
103
Refer to Run Scrapy from a script for more details about using Scrapy this way.
104
The Stats Collector keeps a stats table per open spider which is automatically opened when the spider is opened, and
closed when the spider is closed.
105
spider_stats
A dict of dicts (keyed by spider name) containing the stats of the last scraping run for each spider.
DummyStatsCollector
class scrapy.statscollectors.DummyStatsCollector
A Stats collector which does nothing but is very efficient (because it does nothing). This stats collector can
be set via the STATS_CLASS setting, to disable stats collect in order to improve performance. However, the
performance penalty of stats collection is usually marginal compared to other Scrapy workload like parsing
pages.
Or you can instantiate it passing a Scrapy settings object, which will respect the settings:
mailer = MailSender.from_settings(settings)
If omitted, the
mailfrom (str) the address used to send emails (in the From: header). If omitted, the
MAIL_FROM setting will be used.
smtpuser the SMTP user. If omitted, the MAIL_USER setting will be used. If not given,
no SMTP authentication will be performed.
smtppass (str) the SMTP pass for authentication.
smtpport (int) the SMTP port to connect to
106
107
MAIL_PASS
Default: None
Password to use for SMTP authentication, along with MAIL_USER.
MAIL_TLS
Default: False
Enforce using STARTTLS. STARTTLS is a way to take an existing insecure connection, and upgrade it to a secure
connection using SSL/TLS.
MAIL_SSL
Default: False
Enforce connecting using an SSL encrypted connection
You need the telnet program which comes installed by default in Windows, and most Linux distros.
108
Shortcut
crawler
engine
spider
slot
extensions
stats
settings
est
prefs
p
hpy
Description
the Scrapy Crawler (scrapy.crawler.Crawler object)
Crawler.engine attribute
the active spider
the engine slot
the Extension Manager (Crawler.extensions attribute)
the Stats Collector (Crawler.stats attribute)
the Scrapy settings object (Crawler.settings attribute)
print a report of the engine status
for memory debugging (see Debugging memory leaks)
a shortcut to the pprint.pprint function
for memory debugging (see Debugging memory leaks)
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
8.62972998619
False
16
False
followall
False
False
16
0
92
0
0
0
0
False
To resume:
telnet localhost 6023
>>> engine.unpause()
>>>
109
To stop:
telnet localhost 6023
>>> engine.stop()
Connection closed by foreign host.
110
CHAPTER 5
111
5.1.13 Why does Scrapy download pages in English instead of my native language?
Try changing the default Accept-Language request header by overriding the DEFAULT_REQUEST_HEADERS setting.
112
5.1.16 I get Filtered offsite request messages. How can I fix them?
Those messages (logged with DEBUG level) dont necessarily mean there is a problem, so you may not need to fix
them.
Those message are thrown by the Offsite Spider Middleware, which is a spider middleware (enabled by default) whose
purpose is to filter out requests to domains outside the ones covered by the spider.
For more info see: OffsiteMiddleware.
Or by setting a global download delay in your project with the DOWNLOAD_DELAY setting.
113
5.1.22 Simplest way to dump all my scraped items into a JSON/CSV/XML file?
To dump into a JSON file:
scrapy crawl myspider -o items.json
5.1.23 Whats this huge cryptic __VIEWSTATE parameter used in some forms?
The __VIEWSTATE parameter is used in sites built with ASP.NET/VB.NET. For more info on how it works see this
page. Also, heres an example spider which scrapes one of these sites.
5.1.24 Whats the best way to parse big XML/CSV data feeds?
Parsing big feeds with XPath selectors can be problematic since they need to build the DOM of the entire feed in
memory, and this can be quite slow and consume a lot of memory.
In order to avoid parsing all the entire feed at once in memory, you can use the functions xmliter and csviter
from scrapy.utils.iterators module. In fact, this is what the feed spiders (see Spiders) use under the cover.
5.1.26 How can I see the cookies being sent and received from Scrapy?
Enable the COOKIES_DEBUG setting.
114
5.1.30 Im scraping a XML document and my XPath selector doesnt return any
items
You may need to remove namespaces. See Removing namespaces.
Basically this is a simple spider which parses two pages of items (the start_urls). Items also have a details page with
additional information, so we use the meta functionality of Request to pass a partially populated item.
115
-----------------------------------------------------------------
Using the --verbose or -v option we can see the status at each depth level:
$ scrapy parse --spider=myspider -c parse_item -d 2 -v <item_url>
[ ... scrapy log lines crawling example.com spider ... ]
>>> DEPTH LEVEL: 1 <<<
# Scraped Items -----------------------------------------------------------[]
# Requests ----------------------------------------------------------------[<GET item_details_url>]
-----------------------------------------------------------------
Checking items scraped from a single start_url, can also be easily achieved using:
$ scrapy parse --spider=myspider -d 3 'http://example.com/page1'
116
else:
inspect_response(response, self)
open_in_browser will open a browser with the response received by Scrapy at that point, adjusting the base tag
so that images and styles are displayed properly.
5.2.4 Logging
Logging is another useful option for getting information about your spider run. Although not as convenient, it comes
with the advantage that the logs will be available in all future runs should they be necessary again:
def parse_details(self, response):
item = response.meta.get('item', None)
if item:
# populate more `item` fields
return item
else:
self.logger.warning('No item received for %s', response.url)
117
@returns items 1 16
@returns requests 0 0
@scrapes Title Author Year Price
"""
class scrapy.contracts.default.ReturnsContract
This contract (@returns) sets lower and upper bounds for the items and requests returned by the spider. The
upper bound is optional:
@returns item(s)|request(s) [min [max]]
class scrapy.contracts.default.ScrapesContract
This contract (@scrapes) checks that all the items returned by the callback have the specified fields:
@scrapes field_1 field_2 ...
Each contract must inherit from scrapy.contracts.Contract and can override three methods:
class scrapy.contracts.Contract(method, *args)
Parameters
method (function) callback function to which the contract is associated
args (list) list of arguments passed into the docstring (whitespace separated)
adjust_request_args(args)
This receives a dict as an argument containing default arguments for Request object. Must return the
same or a modified version of it.
pre_process(response)
This allows hooking in various checks on the response received from the sample request, before its being
passed to the callback.
post_process(output)
This allows processing the output of the callback. Iterators are converted listified before being passed to
this hook.
Here is a demo contract which checks the presence of a custom header in the response received.
scrapy.exceptions.ContractFail in order to get the failures pretty printed:
118
Raise
Make sure to check CrawlerProcess documentation to get acquainted with its usage details.
If you are inside a Scrapy project there are some additional helpers you can use to import those components
within the project. You can automatically import your spiders passing their name to CrawlerProcess, and use
get_project_settings to get a Settings instance with your project settings.
What follows is a working example of how to do that, using the testspiders project as example.
119
Theres another Scrapy utility that provides more control over the crawling process:
scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers
to run multiple crawlers, but it wont start or interfere with existing reactors in any way.
Using this class the reactor should be explicitly run after scheduling your spiders. Its recommended you use
CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run
Scrapy in the same reactor.
Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved
by adding callbacks to the deferred returned by the CrawlerRunner.crawl method.
Heres an example of its usage, along with a callback to manually stop the reactor after MySpider has finished running.
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
# Your spider definition
...
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
See also:
Twisted Reactor Overview.
120
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
Same example but running the spiders sequentially by chaining the deferreds:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
See also:
Run Scrapy from a script.
121
Then you fire a spider run on 3 different Scrapyd servers. The spider would receive a (spider) argument part with
the number of the partition to crawl:
they crawl many domains (often, unbounded) instead of a specific set of sites
they dont necessarily crawl domains to completion, because it would impractical (or impossible) to do so, and
instead limit the crawl by time or number of pages crawled
they are simpler in logic (as opposed to very complex spiders with many extraction rules) because data is often
post-processed in a separate stage
they crawl many domains concurrently, which allows them to achieve faster crawl speeds by not being limited
by any particular site constraint (each site is crawled slowly to respect politeness, but many sites are crawled in
parallel)
As said above, Scrapy default settings are optimized for focused crawls, not broad crawls. However, due to its asynchronous architecture, Scrapy is very well suited for performing fast broad crawls. This page summarize some things
you need to keep in mind when using Scrapy for doing broad crawls, along with concrete suggestions of Scrapy
settings to tune in order to achieve an efficient broad crawl.
123
124
When doing broad crawls its common to crawl a lot of index web pages; AjaxCrawlMiddleware helps to crawl
them correctly. It is turned OFF by default because it has some performance overhead, and enabling it for focused
crawls doesnt make much sense.
125
Firecookie
Firecookie makes it easier to view and manage cookies. You can use this extension to create a new cookie, delete
existing cookies, see a list of cookies for the current site, manage cookies permissions and a lot more.
5.7.1 Introduction
This document explains how to use Firebug (a Firefox add-on) to make the scraping process easier and more fun.
For other useful Firefox add-ons see Useful Firefox add-ons for scraping. There are some caveats with using Firefox
add-ons to inspect pages, see Caveats with inspecting the live browser DOM.
In this example, well show how to use Firebug to scrape data from the Google Directory, which contains the same
data as the Open Directory Project used in the tutorial but with a different face.
Firebug comes with a very useful feature called Inspect Element which allows you to inspect the HTML code of the
different page elements just by hovering your mouse over them. Otherwise you would have to search for the tags
manually through the HTML body which can be a very tedious task.
In the following screenshot you can see the Inspect Element tool in action.
126
At first sight, we can see that the directory is divided in categories, which are also divided in subcategories.
However, it seems that there are more subcategories than the ones being shown in this page, so well keep looking:
As expected, the subcategories contain links to other subcategories, and also links to actual websites, which is the
purpose of the directory.
So, based on that regular expression we can create the first crawling rule:
Rule(LinkExtractor(allow='directory.google.com/[A-Z][a-zA-Z_/]+$', ),
'parse_category',
follow=True,
),
The Rule object instructs CrawlSpider based spiders how to follow the category links. parse_category will
be a method of the spider which will process and extract data from those pages.
This is how the spider would look so far:
127
128
As you can see, the page markup is not very descriptive: the elements dont contain id, class or any attribute that
clearly identifies them, so well use the ranking bars as a reference point to select the data to extract when we construct
our XPaths.
After using FireBug, we can see that each link is inside a td tag, which is itself inside a tr tag that also contains the
links ranking bar (in another td).
So we can select the ranking bar, then find its parent (the tr), and then finally, the links td (which contains the data
we want to scrape).
This results in the following XPath:
//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td//a
Its important to use the Scrapy shell to test these complex XPath expressions and make sure they work as expected.
Basically, that expression will look for the ranking bars td element, and then select any td element who has a
descendant a element whose href attribute contains the string #pagerank
Of course, this is not the only XPath, and maybe not the simpler one to select that data. Another approach could be,
for example, to find any font tags that have that grey colour of the links,
Finally, we can write our parse_category() method:
129
Be aware that you may find some elements which appear in Firebug but not in the original HTML, such as the typical
case of <tbody> elements.
or tags which Therefer in page HTML sources may on Firebug inspects the live DOM
130
You can enter the telnet console and inspect how many objects (of the classes mentioned above) are currently alive
using the prefs() function which is an alias to the print_live_refs() function:
telnet localhost 6023
>>> prefs()
Live References
ExampleSpider
HtmlResponse
Selector
FormRequest
1
10
2
878
oldest:
oldest:
oldest:
oldest:
15s ago
1s ago
0s ago
7s ago
As you can see, that report also shows the age of the oldest object in each class. If youre running multiple spiders
per process chances are you can figure out which spider is leaking by looking at the oldest request or response. You
can get the oldest object of each class using the get_oldest() function (from the telnet console).
Which objects are tracked?
The objects tracked by trackrefs are all from these classes (and all its subclasses):
scrapy.http.Request
scrapy.http.Response
scrapy.item.Item
scrapy.selector.Selector
scrapy.spiders.Spider
A real example
Lets see a concrete example of an hypothetical case of memory leaks. Suppose we have some spider with a line
similar to this one:
return Request("http://www.somenastyspider.com/product.php?pid=%d" % product_id,
callback=self.parse, meta={referer: response})
That line is passing a response reference inside a request which effectively ties the response lifetime to the requests
one, and that would definitely cause memory leaks.
Lets see how we can discover the cause (without knowing it a-priori, of course) by using the trackref tool.
After the crawler is running for a few minutes and we notice its memory usage has grown a lot, we can enter its telnet
console and check the live references:
>>> prefs()
Live References
SomenastySpider
HtmlResponse
Selector
Request
1
3890
2
3878
oldest:
oldest:
oldest:
oldest:
15s ago
265s ago
0s ago
250s ago
The fact that there are so many live responses (and that theyre so old) is definitely suspicious, as responses should
have a relatively short lifetime compared to Requests. The number of responses is similar to the number of requests,
so it looks like they are tied in a some way. We can now go and check the code of the spider to discover the nasty line
that is generating the leaks (passing response references inside requests).
131
Sometimes extra information about live objects can be helpful. Lets check the oldest response:
>>> from scrapy.utils.trackref import get_oldest
>>> r = get_oldest('HtmlResponse')
>>> r.url
'http://www.somenastyspider.com/product.php?pid=123'
If you want to iterate over all objects, instead of getting the oldest one,
scrapy.utils.trackref.iter_all() function:
scrapy.utils.trackref module
Here are the functions available in the trackref module.
class scrapy.utils.trackref.object_ref
Inherit from this class (instead of object) if you want to track live instances with the trackref module.
scrapy.utils.trackref.print_live_refs(class_name, ignore=NoneType)
Print a report of live references, grouped by class name.
Parameters ignore (class or classes tuple) if given, all objects from the specified class (or tuple
of classes) will be ignored.
scrapy.utils.trackref.get_oldest(class_name)
Return the oldest object alive with the given class name, or None if none is found. Use print_live_refs()
first to get a list of all tracked live objects per class name.
scrapy.utils.trackref.iter_all(class_name)
Return an iterator over all objects alive with the given class name, or None if none is found.
print_live_refs() first to get a list of all tracked live objects per class name.
Use
132
The telnet console also comes with a built-in shortcut (hpy) for accessing Guppy heap objects. Heres an example to
view all Python objects available in the heap using Guppy:
>>> x = hpy.heap()
>>> x.bytype
Partition of a set of 297033 objects. Total size = 52587824 bytes.
Index Count
%
Size
% Cumulative % Type
0 22307
8 16423880 31 16423880 31 dict
1 122285 41 12441544 24 28865424 55 str
2 68346 23 5966696 11 34832120 66 tuple
3
227
0 5836528 11 40668648 77 unicode
4
2461
1 2222272
4 42890920 82 type
5 16870
6 2024400
4 44915320 85 function
6 13949
5 1673880
3 46589200 89 types.CodeType
7 13422
5 1653104
3 48242304 92 list
8
3735
1 1173680
2 49415984 94 _sre.SRE_Pattern
9
1209
0
456936
1 49872920 95 scrapy.http.headers.Headers
<1676 more rows. Type e.g. '_.more' to view.>
You can see that most space is used by dicts. Then, if you want to see from which attribute those dicts are referenced,
you could do:
>>> x.bytype[0].byvia
Partition of a set of 22307 objects. Total size = 16423880 bytes.
Index Count
%
Size
% Cumulative % Referred Via:
0 10982 49 9416336 57
9416336 57 '.__dict__'
1
1820
8 2681504 16 12097840 74 '.__dict__', '.func_globals'
2
3097 14 1122904
7 13220744 80
3
990
4
277200
2 13497944 82 "['cookies']"
4
987
4
276360
2 13774304 84 "['cache']"
5
985
4
275800
2 14050104 86 "['meta']"
6
897
4
251160
2 14301264 87 '[2]'
7
1
0
196888
1 14498152 88 "['moduleDict']", "['modules']"
8
672
3
188160
1 14686312 89 "['cb_kwargs']"
9
27
0
155016
1 14841328 90 '[1]'
<333 more rows. Type e.g. '_.more' to view.>
As you can see, the Guppy module is very powerful but also requires some deep knowledge about Python internals.
For more info about Guppy, refer to the Guppy documentation.
133
Unfortunately, this patch can only free an arena if there are no more objects allocated in it anymore. This
means that fragmentation is a large issue. An application could have many megabytes of free memory,
scattered throughout all the arenas, but it will be unable to free any of it. This is a problem experienced
by all memory allocators. The only way to solve it is to move to a compacting garbage collector, which is
able to move objects in memory. This would require significant changes to the Python interpreter.
To keep memory consumption reasonable you can split the job into several smaller jobs or enable persistent job queue
and stop/start spider from time to time.
134
The advantage of using the ImagesPipeline for image files is that you can configure some extra functions like
generating thumbnails and filtering the images based on their size.
The Images Pipeline uses Pillow for thumbnailing and normalizing images to JPEG/RGB format, so you need to install
this library in order to use it. Python Imaging Library (PIL) should also work in most cases, but it is known to cause
troubles in some setups, so we recommend to use Pillow instead of PIL.
If you need something more complex and want to override the custom pipeline behaviour, see Extending the Media
Pipelines.
Note: You can also use both the Files and Images Pipeline at the same time.
Then, configure the target storage setting to a valid value that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.
For the Files Pipeline, set the FILES_STORE setting:
FILES_STORE = '/path/to/valid/dir'
135
Where:
<IMAGES_STORE> is the directory defined in IMAGES_STORE setting for the Images Pipeline.
full is a sub-directory to separate full images from thumbnails (if used). For more info see Thumbnail generation for images.
When you use this feature, the Images Pipeline will create thumbnails of the each specified size with this format:
<IMAGES_STORE>/thumbs/<size_name>/<image_id>.jpg
Where:
<size_name> is the one specified in the IMAGES_THUMBS dictionary keys (small, big, etc)
<image_id> is the SHA1 hash of the image url
136
Example of image files stored using small and big thumbnail names:
<IMAGES_STORE>/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/small/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
The first one is the full image, as downloaded from the site.
Filtering out small images
When using the Images Pipeline, you can drop images which are too small, by specifying the minimum allowed size
in the IMAGES_MIN_HEIGHT and IMAGES_MIN_WIDTH settings.
For example:
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
Those requests will be processed by the pipeline and, when they have finished downloading, the results
will be sent to the item_completed() method, as a list of 2-element tuples. Each tuple will contain
(success, file_info_or_error) where:
success is a boolean which is True if the image was downloaded successfully or False if it failed
for some reason
file_info_or_error is a dict containing the following keys (if success is True) or a Twisted
Failure if there was a problem.
url - the url where the file was downloaded from. This is the url of the request returned from the
get_media_requests() method.
path - the path (relative to FILES_STORE) where the file was stored
checksum - a MD5 hash of the image contents
The list of tuples received by item_completed() is guaranteed to retain the same order of the requests
returned from the get_media_requests() method.
Heres a typical value of the results argument:
137
[(True,
{'checksum': '2b00042f7481c7b056c4b410d28f33cf',
'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
'url': 'http://www.example.com/files/product1.pdf'}),
(False,
Failure(...))]
By default the get_media_requests() method returns None which means there are no files to download for the item.
item_completed(results, items, info)
The FilesPipeline.item_completed() method called when all file requests for a single item
have completed (either finished downloading, or failed for some reason).
The item_completed() method must return the output that will be sent to subsequent item pipeline
stages, so you must return (or drop) the item, as you would in any pipeline.
Here is an example of the item_completed() method where we store the downloaded file paths
(passed in results) in the file_paths item field, and we drop the item if it doesnt contain any files:
from scrapy.exceptions import DropItem
def item_completed(self, results, item, info):
file_paths = [x['path'] for ok, x in results if ok]
if not file_paths:
raise DropItem("Item contains no files")
item['file_paths'] = file_paths
return item
138
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
139
140
2. when a response is received, the download delay is adjusted to the average of previous download delay and the
latency of the response.
Note:
The AutoThrottle extension honours the standard Scrapy settings for concurrency and delay. This
means that it will never set a download delay lower than DOWNLOAD_DELAY or a concurrency higher than
CONCURRENT_REQUESTS_PER_DOMAIN (or CONCURRENT_REQUESTS_PER_IP, depending on which one
you use).
5.12.4 Settings
The settings used to control the AutoThrottle extension are:
AUTOTHROTTLE_ENABLED
AUTOTHROTTLE_START_DELAY
AUTOTHROTTLE_MAX_DELAY
AUTOTHROTTLE_DEBUG
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
DOWNLOAD_DELAY
For more information see Throttling algorithm.
AUTOTHROTTLE_ENABLED
Default: False
Enables the AutoThrottle extension.
AUTOTHROTTLE_START_DELAY
Default: 5.0
The initial download delay (in seconds).
AUTOTHROTTLE_MAX_DELAY
Default: 60.0
The maximum download delay (in seconds) to be set in case of high latencies.
AUTOTHROTTLE_DEBUG
Default: False
Enable AutoThrottle debug mode which will display stats on every response received, so you can see how the throttling
parameters are being adjusted in real time.
141
5.13 Benchmarking
New in version 0.17.
Scrapy comes with a simple benchmarking suite that spawns a local HTTP server and crawls it at the maximum
possible speed. The goal of this benchmarking is to get an idea of how Scrapy performs in your hardware, in order to
have a common baseline for comparisons. It uses a simple spider that does nothing and just follows links.
To run it use:
scrapy bench
That tells you that Scrapy is able to crawl about 3900 pages per minute in the hardware where you run it. Note that
this is a very simple spider intended to follow links, any custom spider you write will probably do more stuff which
results in slower crawl rates. How slower depends on how much your spider does and how well its written.
In the future, more cases will be added to the benchmarking suite to cover other common scenarios.
Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing
the same command:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
143
144
CHAPTER 6
Extending Scrapy
6.1.1 Overview
The following diagram shows an overview of the Scrapy architecture with its components and an outline of the data
flow that takes place inside the system (shown by the green arrows). A brief description of the components is included
below with links for more detailed information about them. The data flow is also described below.
145
6.1.2 Components
Scrapy Engine
The engine is responsible for controlling the data flow between all components of the system, and triggering events
when certain actions occur. See the Data Flow section below for more details.
Scheduler
The Scheduler receives requests from the engine and enqueues them for feeding them later (also to the engine) when
the engine requests them.
Downloader
The Downloader is responsible for fetching web pages and feeding them to the engine which, in turn, feeds them to
the spiders.
Spiders
Spiders are custom classes written by Scrapy users to parse responses and extract items (aka scraped items) from them
or additional URLs (requests) to follow. Each spider is able to handle a specific domain (or group of domains). For
more information see Spiders.
Item Pipeline
The Item Pipeline is responsible for processing the items once they have been extracted (or scraped) by the spiders.
Typical tasks include cleansing, validation and persistence (like storing the item in a database). For more information
see Item Pipeline.
Downloader middlewares
Downloader middlewares are specific hooks that sit between the Engine and the Downloader and process requests
when they pass from the Engine to the Downloader, and responses that pass from Downloader to the Engine. They
provide a convenient mechanism for extending Scrapy functionality by plugging custom code. For more information
see Downloader Middleware.
Spider middlewares
Spider middlewares are specific hooks that sit between the Engine and the Spiders and are able to process spider input
(responses) and output (items and requests). They provide a convenient mechanism for extending Scrapy functionality
by plugging custom code. For more information see Spider Middleware.
3. The Engine asks the Scheduler for the next URLs to crawl.
4. The Scheduler returns the next URLs to crawl to the Engine and the Engine sends them to the Downloader,
passing through the Downloader Middleware (request direction).
5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the
Engine, passing through the Downloader Middleware (response direction).
6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing
through the Spider Middleware (input direction).
7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine.
8. The Engine sends scraped items (returned by the Spider) to the Item Pipeline and Requests (returned by spider)
to the Scheduler
9. The process repeats (from step 2) until there are no more requests from the Scheduler, and the Engine closes the
domain.
The DOWNLOADER_MIDDLEWARES setting is merged with the DOWNLOADER_MIDDLEWARES_BASE setting defined in Scrapy (and not meant to be overridden) and then sorted by order to get the final sorted list of enabled
middlewares: the first middleware is the one closer to the engine and the last is the one closer to the downloader.
To decide which order to assign to your middleware see the DOWNLOADER_MIDDLEWARES_BASE setting and pick a
value according to where you want to insert the middleware. The order does matter because each middleware performs
a different action and your middleware could depend on some previous (or subsequent) middleware being applied.
147
If you want to disable a built-in middleware (the ones defined in DOWNLOADER_MIDDLEWARES_BASE and enabled
by default) you must define it in your projects DOWNLOADER_MIDDLEWARES setting and assign None as its value.
For example, if you want to disable the user-agent middleware:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware
documentation for more info.
148
Parameters
request (is a Request object) the request that originated the response
response (Response object) the response being processed
spider (Spider object) the spider for which this response is intended
process_exception(request, exception, spider)
Scrapy calls process_exception() when a download handler or a process_request() (from a
downloader middleware) raises an exception (including an IgnoreRequest exception)
process_exception() should return: either None, a Response object, or a Request object.
If it returns None, Scrapy will continue processing this exception, executing any other
process_exception() methods of installed middleware, until no middleware is left and the default
exception handling kicks in.
If it returns a Response object, the process_response() method chain of installed middleware is
started, and Scrapy wont bother calling any other process_exception() methods of middleware.
If it returns a Request object, the returned request is rescheduled to be downloaded in the future. This
stops the execution of process_exception() methods of the middleware the same as returning a
response would.
Parameters
request (is a Request object) the request that generated the exception
exception (an Exception object) the raised exception
spider (Spider object) the spider for which this request is intended
149
For example:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
Keep in mind that the cookiejar meta key is not sticky. You need to keep passing it along on subsequent requests.
For example:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
COOKIES_ENABLED
Default: True
Whether to enable the cookies middleware. If disabled, no cookies will be sent to web servers.
COOKIES_DEBUG
Default: False
If enabled, Scrapy will log all cookies sent in requests (ie. Cookie header) and all cookies received in responses (ie.
Set-Cookie header).
Heres an example of a log with COOKIES_DEBUG enabled:
DefaultHeadersMiddleware
class scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware
This middleware sets all default requests headers specified in the DEFAULT_REQUEST_HEADERS setting.
DownloadTimeoutMiddleware
class scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware
This middleware sets the download timeout for requests specified in the DOWNLOAD_TIMEOUT setting or
download_timeout spider attribute.
Note: You can also set download timeout per-request using download_timeout Request.meta key; this is supported even when DownloadTimeoutMiddleware is disabled.
150
HttpAuthMiddleware
class scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware
This middleware authenticates all requests generated from certain spiders using Basic access authentication
(aka. HTTP auth).
To enable HTTP authentication from certain spiders, set the http_user and http_pass attributes of those
spiders.
Example:
from scrapy.spiders import CrawlSpider
class SomeIntranetSiteSpider(CrawlSpider):
http_user = 'someuser'
http_pass = 'somepass'
name = 'intranet.example.com'
# .. rest of the spider code omitted ...
HttpCacheMiddleware
class scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware
This middleware provides low-level cache to all HTTP requests and responses. It has to be combined with a
cache storage backend as well as a cache policy.
Scrapy ships with two HTTP cache storage backends:
Filesystem storage backend (default)
DBM storage backend
You can change the HTTP cache storage backend with the HTTPCACHE_STORAGE setting. Or you can also
implement your own storage backend.
Scrapy ships with two HTTP cache policies:
RFC2616 policy
Dummy policy (default)
You can change the HTTP cache policy with the HTTPCACHE_POLICY setting. Or you can also implement
your own policy. You can also avoid caching a response on every policy using dont_cache meta key equals
True.
Dummy policy (default)
This policy has no awareness of any HTTP Cache-Control directives. Every request and its corresponding response are
cached. When the same request is seen again, the response is returned without transferring anything from the Internet.
The Dummy policy is useful for testing spiders faster (without having to wait for downloads every time) and for trying
your spider offline, when an Internet connection is not available. The goal is to be able to replay a spider run exactly
as it ran before.
In order to use this policy, set:
HTTPCACHE_POLICY to scrapy.extensions.httpcache.DummyPolicy
151
RFC2616 policy
This policy provides a RFC2616 compliant HTTP cache, i.e. with HTTP Cache-Control awareness, aimed at production and used in continuous runs to avoid downloading unmodified data (to save bandwidth and speed up crawls).
what is implemented:
Do not attempt to store responses/requests with no-store cache-control directive set
Do not serve responses from cache if no-cache cache-control directive is set even for fresh responses
Compute freshness lifetime from max-age cache-control directive
Compute freshness lifetime from Expires response header
Compute freshness lifetime from Last-Modified response header (heuristic used by Firefox)
Compute current age from Age response header
Compute current age from Date header
Revalidate stale responses based on Last-Modified response header
Revalidate stale responses based on ETag response header
Set Date header for any received response missing it
what is missing:
Pragma: no-cache support http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.1
Vary header support http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.6
Invalidation after updates or deletes http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.10
... probably others ..
In order to use this policy, set:
HTTPCACHE_POLICY to scrapy.extensions.httpcache.RFC2616Policy
Filesystem storage backend (default)
File system storage backend is available for the HTTP cache middleware.
In order to use this storage backend, set:
HTTPCACHE_STORAGE to scrapy.extensions.httpcache.FilesystemCacheStorage
Each request/response pair is stored in a different directory containing the following files:
request_body - the plain request body
request_headers - the request headers (in raw HTTP format)
response_body - the plain response body
response_headers - the request headers (in raw HTTP format)
meta - some metadata of this cache resource in Python repr() format (grep-friendly format)
pickled_meta - the same metadata in meta but pickled for more efficient deserialization
The directory name is made from the request fingerprint (see scrapy.utils.request.fingerprint), and
one level of subdirectories is used to avoid creating too many files into the same directory (which is inefficient in many
file systems). An example directory could be:
152
/path/to/cache/dir/example.com/72/72811f648e718090f041317756c03adb0ada46c7
153
154
HttpProxyMiddleware
New in version 0.8.
class scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware
This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value for Request
objects.
Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:
http_proxy
https_proxy
no_proxy
You can also set the meta key proxy per-request, to a value like http://some_proxy_server:port.
RedirectMiddleware
class scrapy.downloadermiddlewares.redirect.RedirectMiddleware
This middleware handles redirection of requests based on response status.
The urls which the request goes through (while being redirected) can be found in the redirect_urls
Request.meta key.
The RedirectMiddleware can be configured through the following settings (see the settings documentation for
more info):
REDIRECT_ENABLED
REDIRECT_MAX_TIMES
If Request.meta has dont_redirect key set to True, the request will be ignored by this middleware.
RedirectMiddleware settings
155
This middleware obey REDIRECT_MAX_TIMES setting, dont_redirect and redirect_urls request meta
keys as described for RedirectMiddleware
MetaRefreshMiddleware settings
156
RobotsTxtMiddleware
class scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware
This middleware filters out requests forbidden by the robots.txt exclusion standard.
To make sure Scrapy respects robots.txt make sure the middleware is enabled and the ROBOTSTXT_OBEY
setting is enabled.
Warning: Keep in mind that, if you crawl using multiple concurrent requests per domain, Scrapy could still
download some forbidden pages if they were requested before the robots.txt file was downloaded. This is a
known limitation of the current robots.txt middleware and will be fixed in the future.
If Request.meta has dont_obey_robotstxt key set to True the request will be ignored by this middleware
even if ROBOTSTXT_OBEY is enabled.
DownloaderStats
class scrapy.downloadermiddlewares.stats.DownloaderStats
Middleware that stores stats of all requests, responses and exceptions that pass through it.
To use this middleware you must enable the DOWNLOADER_STATS setting.
UserAgentMiddleware
class scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
Middleware that allows spiders to override the default user agent.
In order for a spider to override the default user agent, its user_agent attribute must be set.
AjaxCrawlMiddleware
class scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware
Middleware that finds AJAX crawlable page variants based on meta-fragment html tag.
https://developers.google.com/webmasters/ajax-crawling/docs/getting-started for more info.
See
Note: Scrapy finds AJAX crawlable pages for URLs like http://example.com/!#foo=bar even
without this middleware. AjaxCrawlMiddleware is necessary when URL doesnt contain !#. This is often a
case for index or main website pages.
AjaxCrawlMiddleware Settings
157
The SPIDER_MIDDLEWARES setting is merged with the SPIDER_MIDDLEWARES_BASE setting defined in Scrapy
(and not meant to be overridden) and then sorted by order to get the final sorted list of enabled middlewares: the first
middleware is the one closer to the engine and the last is the one closer to the spider.
To decide which order to assign to your middleware see the SPIDER_MIDDLEWARES_BASE setting and pick a value
according to where you want to insert the middleware. The order does matter because each middleware performs a
different action and your middleware could depend on some previous (or subsequent) middleware being applied.
If you want to disable a builtin middleware (the ones defined in SPIDER_MIDDLEWARES_BASE, and enabled by default) you must define it in your project SPIDER_MIDDLEWARES setting and assign None as its value. For example,
if you want to disable the off-site middleware:
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomSpiderMiddleware': 543,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
}
Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware
documentation for more info.
158
Parameters
response (Response object) the response being processed
spider (Spider object) the spider for which this response is intended
process_spider_output(response, result, spider)
This method is called with the results returned from the Spider, after it has processed the response.
process_spider_output() must return an iterable of Request, dict or Item objects.
Parameters
response (Response object) the response which generated this output from the spider
result (an iterable of Request, dict or Item objects) the result returned by the
spider
spider (Spider object) the spider whose result is being processed
process_spider_exception(response, exception, spider)
This method is called when when a spider or process_spider_input() method (from other spider
middleware) raises an exception.
process_spider_exception() should return either None or an iterable of Response, dict or
Item objects.
If it returns None, Scrapy will continue processing this exception, executing any other
process_spider_exception() in the following middleware components, until no middleware
components are left and the exception reaches the engine (where its logged and discarded).
If it returns an iterable the process_spider_output() pipeline kicks in, and no other
process_spider_exception() will be called.
Parameters
response (Response object) the response being processed when the exception was
raised
exception (Exception object) the exception raised
spider (Spider object) the spider which raised the exception
process_start_requests(start_requests, spider)
New in version 0.15.
This method is called with the start requests of the spider, and works similarly to the
process_spider_output() method, except that it doesnt have a response associated and must
return only requests (not items).
It receives an iterable (in the start_requests parameter) and must return another iterable of
Request objects.
Note: When implementing this method in your spider middleware, you should always return an iterable
(that follows the input one) and not consume all start_requests iterator because it can be very large
(or even unbounded) and cause a memory overflow. The Scrapy engine is designed to pull start requests
while it has capacity to process them, so the start requests iterator can be effectively endless where there
is some other condition for stopping the spider (like a time limit or item/page count).
Parameters
start_requests (an iterable of Request) the start requests
159
spider (Spider object) the spider to whom the start requests belong
The handle_httpstatus_list key of Request.meta can also be used to specify which response codes to
allow on a per-request basis. You can also set the meta key handle_httpstatus_all to True if you want to
allow any response code for a request.
Keep in mind, however, that its usually a bad idea to handle non-200 responses, unless you really know what youre
doing.
For more information see: HTTP Status Code Definitions.
HttpErrorMiddleware settings
HTTPERROR_ALLOWED_CODES Default: []
Pass all responses with non-200 status codes contained in this list.
160
To avoid filling the log with too much noise, it will only print one of these messages for each new domain
filtered. So, for example, if another request for www.othersite.com is filtered, no log message will be
printed. But if a request for someothersite.com is filtered, a message will be printed (but only for the first
request filtered).
If the spider doesnt define an allowed_domains attribute, or the attribute is empty, the offsite middleware
will allow all requests.
If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its
domain is not listed in allowed domains.
RefererMiddleware
class scrapy.spidermiddlewares.referer.RefererMiddleware
Populates Request Referer header, based on the URL of the Response which generated it.
RefererMiddleware settings
161
6.4 Extensions
The extensions framework provides a mechanism for inserting your own custom functionality into Scrapy.
Extensions are just regular classes that are instantiated at Scrapy startup, when extensions are initialized.
As you can see, the EXTENSIONS setting is a dict where the keys are the extension paths, and their values are the
orders, which define the extension loading order. Extensions orders are not as important as middleware orders though,
and they are typically irrelevant, ie. it doesnt matter in which order the extensions are loaded because they dont
depend on each other [1].
However, this feature can be exploited if you need to add an extension which depends on other extensions already
loaded.
[1] This is is why the EXTENSIONS_BASE setting in Scrapy (which contains all built-in extensions enabled by
default) defines all the extensions with the same order (500).
162
6.4. Extensions
163
class scrapy.extensions.logstats.LogStats
Log basic stats like crawled pages and scraped items.
Core Stats extension
class scrapy.extensions.corestats.CoreStats
Enable the collection of core statistics, provided the stats collection is enabled (see Stats Collection).
Telnet console extension
class scrapy.telnet.TelnetConsole
Provides a telnet console for getting into a Python interpreter inside the currently running Scrapy process, which can
be very useful for debugging.
The telnet console must be enabled by the TELNETCONSOLE_ENABLED setting, and the server will listen in the port
specified in TELNETCONSOLE_PORT.
Memory usage extension
class scrapy.extensions.memusage.MemoryUsage
Note: This extension does not work in Windows.
Monitors the memory used by the Scrapy process that runs the spider and:
1. sends a notification e-mail when it exceeds a certain value
2. closes the spider when it exceeds a certain value
The notification e-mails can be triggered when a certain warning value is reached (MEMUSAGE_WARNING_MB) and
when the maximum value is reached (MEMUSAGE_LIMIT_MB) which will also cause the spider to be closed and the
Scrapy process to be terminated.
This extension is enabled by the MEMUSAGE_ENABLED setting and can be configured with the following settings:
164
MEMUSAGE_LIMIT_MB
MEMUSAGE_WARNING_MB
MEMUSAGE_NOTIFY_MAIL
MEMUSAGE_REPORT
Memory debugger extension
class scrapy.extensions.memdebug.MemoryDebugger
An extension for debugging memory usage. It collects information about:
objects uncollected by the Python garbage collector
objects left alive that shouldnt. For more info, see Debugging memory leaks with trackref
To enable this extension, turn on the MEMDEBUG_ENABLED setting. The info will be stored in the stats.
Close spider extension
class scrapy.extensions.closespider.CloseSpider
Closes a spider automatically when some conditions are met, using a specific closing reason for each condition.
The conditions for closing a spider can be configured through the following settings:
CLOSESPIDER_TIMEOUT
CLOSESPIDER_ITEMCOUNT
CLOSESPIDER_PAGECOUNT
CLOSESPIDER_ERRORCOUNT
CLOSESPIDER_TIMEOUT Default: 0
An integer which specifies a number of seconds. If the spider remains open for more than that number of second, it
will be automatically closed with the reason closespider_timeout. If zero (or non set), spiders wont be closed
by timeout.
CLOSESPIDER_ITEMCOUNT Default: 0
An integer which specifies a number of items. If the spider scrapes more than that amount if items and those items are
passed by the item pipeline, the spider will be closed with the reason closespider_itemcount. If zero (or non
set), spiders wont be closed by number of passed items.
CLOSESPIDER_PAGECOUNT New in version 0.11.
Default: 0
An integer which specifies the maximum number of responses to crawl. If the spider crawls more than that, the spider
will be closed with the reason closespider_pagecount. If zero (or non set), spiders wont be closed by number
of crawled responses.
6.4. Extensions
165
class scrapy.extensions.statsmailer.StatsMailer
This simple extension can be used to send a notification e-mail every time a domain has finished scraping, including
the Scrapy stats collected. The email will be sent to all recipients specified in the STATSMAILER_RCPTS setting.
Debugging extensions
Stack trace dump extension
class scrapy.extensions.debug.StackTraceDump
Dumps information about the running process when a SIGQUIT or SIGUSR2 signal is received. The information
dumped is the following:
1. engine status (using scrapy.utils.engine.get_engine_status())
2. live references (see Debugging memory leaks with trackref )
3. stack trace of all threads
After the stack trace and engine status is dumped, the Scrapy process continues running normally.
This extension only works on POSIX-compliant platforms (ie. not Windows), because the SIGQUIT and SIGUSR2
signals are not available on Windows.
There are at least two ways to send Scrapy the SIGQUIT signal:
1. By pressing Ctrl-while a Scrapy process is running (Linux only?)
2. By running this command (assuming <pid> is the process id of the Scrapy process):
kill -QUIT <pid>
Debugger extension
class scrapy.extensions.debug.Debugger
Invokes a Python debugger inside a running Scrapy process when a SIGUSR2 signal is received. After the debugger
is exited, the Scrapy process continues running normally.
For more info see Debugging in Python.
This extension only works on POSIX-compliant platforms (ie. not Windows).
166
167
spider
Spider currently being crawled. This is an instance of the spider class provided while constructing the
crawler, and it is created after the arguments given in the crawl() method.
crawl(*args, **kwargs)
Starts the crawler by instantiating its spider class with the given args and kwargs arguments, while setting
the execution engine in motion.
Returns a deferred that is fired when the crawl is finished.
class scrapy.crawler.CrawlerRunner(settings=None)
This is a convenient helper class that keeps track of, manages and runs crawlers inside an already setup Twisted
reactor.
The CrawlerRunner object must be instantiated with a Settings object.
This class shouldnt be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that
manually handle the crawling process. See Run Scrapy from a script for an example.
crawl(crawler_or_spidercls, *args, **kwargs)
Run a crawler with the provided arguments.
It will call the given Crawlers crawl() method, while keeping track of it so it can be stopped later.
If crawler_or_spidercls isnt a Crawler instance, this method will try to create one using this parameter
as the spider class given to it.
Returns a deferred that is fired when the crawling is finished.
Parameters
crawler_or_spidercls (Crawler instance, Spider subclass or string) already
created crawler, or a spider class or spiders name inside the project to create it
args (list) arguments to initialize the spider
kwargs (dict) keyword arguments to initialize the spider
crawlers
Set of crawlers started by crawl() and managed by this class.
join()
Returns a deferred that is fired when all managed crawlers have completed their executions.
stop()
Stops simultaneously all the crawling jobs taking place.
Returns a deferred that is fired when they all have ended.
class scrapy.crawler.CrawlerProcess(settings=None)
Bases: scrapy.crawler.CrawlerRunner
A class to run multiple scrapy crawlers in a process simultaneously.
This class extends CrawlerRunner by adding support for starting a Twisted reactor and handling shutdown
signals, like the keyboard interrupt command Ctrl-C. It also configures top-level logging.
This utility should be a better fit than CrawlerRunner if you arent running another Twisted reactor within
your application.
The CrawlerProcess object must be instantiated with a Settings object.
This class shouldnt be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that
manually handle the crawling process. See Run Scrapy from a script for an example.
168
169
Additional values can be passed on initialization with the values argument, and they would take
the priority level. If the latter argument is a string, the priority name will be looked up in
SETTINGS_PRIORITIES. Otherwise, a expecific integer should be provided.
Once the object is created, new settings can be loaded or updated with the set() method, and can be accessed with the square bracket notation of dictionaries, or with the get() method of the instance and its value
conversion variants. When requesting a stored key, the value with the highest priority will be retrieved.
set(name, value, priority=project)
Store a key/value attribute with a given priority.
Settings should be populated before configuring the Crawler object (through the configure() method),
otherwise they wont have any effect.
Parameters
name (string) the setting name
value (any) the value to associate with the setting
priority (string or int) the priority of the setting.
SETTINGS_PRIORITIES or an integer
Should be a key of
setdict(values, priority=project)
Store key/value pairs with a given priority.
This is a helper function that calls set() for every item of values with the provided priority.
Parameters
values (dict) the settings names and values
priority (string or int) the priority of the settings.
SETTINGS_PRIORITIES or an integer
Should be a key of
setmodule(module, priority=project)
Store settings from a module with a given priority.
This is a helper function that calls set() for every globally declared uppercase variable of module with
the provided priority.
Parameters
module (module object or string) the module or the path of the module
priority (string or int) the priority of the settings.
SETTINGS_PRIORITIES or an integer
Should be a key of
get(name, default=None)
Get a setting value without affecting its original type.
Parameters
name (string) the setting name
default (any) the value to return if no setting is found
getbool(name, default=False)
Get a setting value as a boolean. For example, both 1 and 1, and True return True, while 0, 0,
False and None return False
For example, settings populated through environment variables set to 0 will return False when using
this method.
Parameters
170
171
disconnect_all(signal)
Disconnect all receivers from the given signal.
Parameters signal (object) the signal to disconnect from
6.6 Signals
Scrapy uses signals extensively to notify when certain events occur. You can catch some of those signals in your
Scrapy project (using an extension, for example) to perform additional tasks or extend Scrapy to add functionality not
provided out of the box.
Even though signals provide several arguments, the handlers that catch them dont need to accept all of them - the
signal dispatching mechanism will only deliver the arguments that the handler receives.
You can connect to signals (or send your own) through the Signals API.
6.6. Signals
173
engine_stopped
scrapy.signals.engine_stopped()
Sent when the Scrapy engine is stopped (for example, when a crawling process has finished).
This signal supports returning deferreds from their handlers.
item_scraped
scrapy.signals.item_scraped(item, response, spider)
Sent when an item has been scraped, after it has passed all the Item Pipeline stages (without being dropped).
This signal supports returning deferreds from their handlers.
Parameters
item (dict or Item object) the item scraped
spider (Spider object) the spider which scraped the item
response (Response object) the response from where the item was scraped
item_dropped
scrapy.signals.item_dropped(item, response, exception, spider)
Sent after an item has been dropped from the Item Pipeline when some stage raised a DropItem exception.
This signal supports returning deferreds from their handlers.
Parameters
item (dict or Item object) the item dropped from the Item Pipeline
spider (Spider object) the spider which scraped the item
response (Response object) the response from where the item was dropped
174
exception (DropItem exception) the exception (which must be a DropItem subclass) which caused the item to be dropped
spider_closed
scrapy.signals.spider_closed(spider, reason)
Sent after a spider has been closed. This can be used to release per-spider resources reserved on
spider_opened.
This signal supports returning deferreds from their handlers.
Parameters
spider (Spider object) the spider which has been closed
reason (str) a string which describes the reason why the spider was closed. If it was
closed because the spider has completed scraping, the reason is finished. Otherwise,
if the spider was manually closed by calling the close_spider engine method, then
the reason is the one passed in the reason argument of that method (which defaults to
cancelled). If the engine was shutdown (for example, by hitting Ctrl-C to stop it) the
reason will be shutdown.
spider_opened
scrapy.signals.spider_opened(spider)
Sent after a spider has been opened for crawling. This is typically used to reserve per-spider resources, but can
be used for any task that needs to be performed when a spider is opened.
This signal supports returning deferreds from their handlers.
Parameters spider (Spider object) the spider which has been opened
spider_idle
scrapy.signals.spider_idle(spider)
Sent when a spider has gone idle, which means the spider has no further:
requests waiting to be downloaded
requests scheduled
items being processed in the item pipeline
If the idle state persists after all handlers of this signal have finished, the engine starts closing the spider. After
the spider has finished closing, the spider_closed signal is sent.
You can, for example, schedule some requests in your spider_idle handler to prevent the spider from being
closed.
This signal does not support returning deferreds from their handlers.
Parameters spider (Spider object) the spider which has gone idle
spider_error
scrapy.signals.spider_error(failure, response, spider)
Sent when a spider callback generates an error (ie. raises an exception).
6.6. Signals
175
This signal does not support returning deferreds from their handlers.
Parameters
failure (Failure object) the exception raised as a Twisted Failure object
response (Response object) the response being processed when the exception was
raised
spider (Spider object) the spider which raised the exception
request_scheduled
scrapy.signals.request_scheduled(request, spider)
Sent when the engine schedules a Request, to be downloaded later.
The signal does not support returning deferreds from their handlers.
Parameters
request (Request object) the request that reached the scheduler
spider (Spider object) the spider that yielded the request
request_dropped
scrapy.signals.request_dropped(request, spider)
Sent when a Request, scheduled by the engine to be downloaded later, is rejected by the scheduler.
The signal does not support returning deferreds from their handlers.
Parameters
request (Request object) the request that reached the scheduler
spider (Spider object) the spider that yielded the request
response_received
scrapy.signals.response_received(response, request, spider)
Sent when the engine receives a new Response from the downloader.
This signal does not support returning deferreds from their handlers.
Parameters
response (Response object) the response received
request (Request object) the request that generated the response
spider (Spider object) the spider for which the response is intended
response_downloaded
scrapy.signals.response_downloaded(response, request, spider)
Sent by the downloader right after a HTTPResponse is downloaded.
This signal does not support returning deferreds from their handlers.
Parameters
176
177
178
BaseItemExporter
class scrapy.exporters.BaseItemExporter(fields_to_export=None, export_empty_fields=False,
encoding=utf-8)
This is the (abstract) base class for all Item Exporters. It provides support for common features used by all
(concrete) Item Exporters, such as defining what fields to export, whether to export empty fields, or which
encoding to use.
These features can be configured through the constructor arguments which populate their respective instance
attributes: fields_to_export, export_empty_fields, encoding.
export_item(item)
Exports the given item. This method must be implemented in subclasses.
serialize_field(field, name, value)
Return the serialized value for the given field. You can override this method (in your custom Item Exporters) if you want to control how a particular field or value will be serialized/exported.
By default, this method looks for a serializer declared in the item field and returns the result of applying
that serializer to the value. If no serializer is found, it returns the value unchanged except for unicode
values which are encoded to str using the encoding declared in the encoding attribute.
Parameters
field (Field object or an empty dict) the field being serialized. If a raw dict is being
exported (not Item) field value is an empty dict.
name (str) the name of the field being serialized
value the value being serialized
start_exporting()
Signal the beginning of the exporting process. Some exporters may use this to generate some required
header (for example, the XmlItemExporter). You must call this method before exporting any items.
finish_exporting()
Signal the end of the exporting process. Some exporters may use this to generate some required footer (for
example, the XmlItemExporter). You must always call this method after you have no more items to
export.
fields_to_export
A list with the name of the fields that will be exported, or None if you want to export all fields. Defaults to
None.
Some exporters (like CsvItemExporter) respect the order of the fields defined in this attribute.
Some exporters may require fields_to_export list in order to export the data properly when spiders return
dicts (not Item instances).
export_empty_fields
Whether to include empty/unpopulated item fields in the exported data. Defaults to False. Some exporters (like CsvItemExporter) ignore this attribute and always export all empty fields.
This option is ignored for dict items.
encoding
The encoding that will be used to encode unicode values. This only affects unicode values (which are
always serialized to str using this encoding). Other value types are passed unchanged to the specific
serialization library.
179
XmlItemExporter
class scrapy.exporters.XmlItemExporter(file,
item_element=item,
**kwargs)
Exports Items in XML format to the specified file object.
root_element=items,
Parameters
file the file-like object to use for exporting the data.
root_element (str) The name of root element in the exported XML.
item_element (str) The name of each item element in the exported XML.
The additional keyword arguments of this constructor are passed to the BaseItemExporter constructor.
A typical output of this exporter would be:
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<name>Color TV</name>
<price>1200</price>
</item>
<item>
<name>DVD player</name>
<price>200</price>
</item>
</items>
Unless overridden in the serialize_field() method, multi-valued fields are exported by serializing each
value inside a <value> element. This is for convenience, as multi-valued fields are very common.
For example, the item:
Item(name=['John', 'Doe'], age='23')
CsvItemExporter
class scrapy.exporters.CsvItemExporter(file, include_headers_line=True, join_multivalued=, ,
**kwargs)
Exports Items in CSV format to the given file-like object. If the fields_to_export attribute is set, it will
be used to define the CSV columns and their order. The export_empty_fields attribute has no effect on
this exporter.
Parameters
file the file-like object to use for exporting the data.
180
PickleItemExporter
class scrapy.exporters.PickleItemExporter(file, protocol=0, **kwargs)
Exports Items in pickle format to the given file-like object.
Parameters
file the file-like object to use for exporting the data.
protocol (int) The pickle protocol to use.
For more information, refer to the pickle module documentation.
The additional keyword arguments of this constructor are passed to the BaseItemExporter constructor.
Pickle isnt a human readable format, so no output examples are provided.
PprintItemExporter
class scrapy.exporters.PprintItemExporter(file, **kwargs)
Exports Items in pretty print format to the specified file object.
Parameters file the file-like object to use for exporting the data.
The additional keyword arguments of this constructor are passed to the BaseItemExporter constructor.
A typical output of this exporter would be:
{'name': 'Color TV', 'price': '1200'}
{'name': 'DVD player', 'price': '200'}
181
Warning: JSON is very simple and flexible serialization format, but it doesnt scale well for large amounts
of data since incremental (aka. stream-mode) parsing is not well supported (if at all) among JSON parsers (on
any language), and most of them just parse the entire object in memory. If you want the power and simplicity
of JSON with a more stream-friendly format, consider using JsonLinesItemExporter instead, or
splitting the output in multiple chunks.
JsonLinesItemExporter
class scrapy.exporters.JsonLinesItemExporter(file, **kwargs)
Exports Items in JSON format to the specified file-like object, writing one JSON-encoded item per line. The
additional constructor arguments are passed to the BaseItemExporter constructor, and the leftover arguments to the JSONEncoder constructor, so you can use any JSONEncoder constructor argument to customize
this exporter.
Parameters file the file-like object to use for exporting the data.
A typical output of this exporter would be:
{"name": "Color TV", "price": "1200"}
{"name": "DVD player", "price": "200"}
Unlike the one produced by JsonItemExporter, the format produced by this exporter is well suited for
serializing large amounts of data.
Architecture overview Understand the Scrapy architecture.
Downloader Middleware Customize how pages get requested and downloaded.
Spider Middleware Customize the input and output of your spiders.
Extensions Extend Scrapy with your custom functionality
Core API Use it on extensions and middlewares to extend Scrapy functionality
Signals See all available signals and how to work with them.
Item Exporters Quickly export your scraped items to a file (XML, CSV, etc).
182
CHAPTER 7
183
New version
class MySpider(scrapy.Spider):
def parse(self, response):
return {'url': response.url}
New version
import logging
logging.info('MESSAGE')
Logging with spiders remains the same, but on top of the log() method youll have access to a custom logger
created for the spider to issue log events:
class MySpider(scrapy.Spider):
def parse(self, response):
self.logger.info('Response received')
184
Bear in mind this feature is still under development and its API may change until it reaches a stable status.
See more examples for scripts running Scrapy: Common Practices
Module Relocations
Theres been a large rearrangement of modules trying to improve the general structure of Scrapy. Main changes were
separating various subpackages into new projects and dissolving both scrapy.contrib and scrapy.contrib_exp into top
level packages. Backward compatibility was kept among internal relocations, while importing deprecated modules
expect warnings indicating their new place.
Full list of relocations
Outsourced packages
Note: These extensions went through some minor changes, e.g. some setting names were changed. Please check the
documentation in each new repository to get familiar with the new usage.
Old location
scrapy.commands.deploy
scrapy.contrib.djangoitem
scrapy.webservice
New location
scrapyd-client (See other alternatives here: Deploying Spiders)
scrapy-djangoitem
scrapy-jsonrpc
185
Old location
New location
scrapy.contrib_exp.downloadermiddleware.decompression scrapy.downloadermiddlewares.decompression
scrapy.contrib_exp.iterators
scrapy.utils.iterators
scrapy.contrib.downloadermiddleware
scrapy.downloadermiddlewares
scrapy.contrib.exporter
scrapy.exporters
scrapy.contrib.linkextractors
scrapy.linkextractors
scrapy.contrib.loader
scrapy.loader
scrapy.contrib.loader.processor
scrapy.loader.processors
scrapy.contrib.pipeline
scrapy.pipelines
scrapy.contrib.spidermiddleware
scrapy.spidermiddlewares
scrapy.contrib.spiders
scrapy.spiders
scrapy.extensions.*
scrapy.contrib.closespider
scrapy.contrib.corestats
scrapy.contrib.debug
scrapy.contrib.feedexport
scrapy.contrib.httpcache
scrapy.contrib.logstats
scrapy.contrib.memdebug
scrapy.contrib.memusage
scrapy.contrib.spiderstate
scrapy.contrib.statsmailer
scrapy.contrib.throttle
Plural renames and Modules unification
Old location
scrapy.command
scrapy.dupefilter
scrapy.linkextractor
scrapy.spider
scrapy.squeue
scrapy.statscol
scrapy.utils.decorator
New location
scrapy.commands
scrapy.dupefilters
scrapy.linkextractors
scrapy.spiders
scrapy.squeues
scrapy.statscollectors
scrapy.utils.decorators
Class renames
Old location
scrapy.spidermanager.SpiderManager
New location
scrapy.spiderloader.SpiderLoader
Settings renames
Old location
SPIDER_MANAGER_CLASS
New location
SPIDER_LOADER_CLASS
Changelog
New Features and Enhancements
Python logging (issue 1060, issue 1235, issue 1236, issue 1240, issue 1259, issue 1278, issue 1286)
FEED_EXPORT_FIELDS option (issue 1159, issue 1224)
Dns cache size and timeout options (issue 1132)
support namespace prefix in xmliter_lxml (issue 963)
Reactor threadpool max size setting (issue 1123)
186
187
188
189
190
191
Add a setting to control what class is instanciated as Downloader component (issue 738)
Pass response in item_dropped signal (issue 724)
Improve scrapy check contracts command (issue 733, issue 752)
Document spider.closed() shortcut (issue 719)
Document request_scheduled signal (issue 746)
Add a note about reporting security issues (issue 697)
Add LevelDB http cache storage backend (issue 626, issue 500)
Sort spider list output of scrapy list command (issue 742)
Multiple documentation enhancemens and fixes (issue 575, issue 587, issue 590, issue 596, issue 610, issue 617,
issue 618, issue 627, issue 613, issue 643, issue 654, issue 675, issue 663, issue 711, issue 714)
Bugfixes
Encode unicode URL value when creating Links in RegexLinkExtractor (issue 561)
Ignore None values in ItemLoader processors (issue 556)
Fix link text when there is an inner tag in SGMLLinkExtractor and HtmlParserLinkExtractor (issue 485, issue
574)
Fix wrong checks on subclassing of deprecated classes (issue 581, issue 584)
Handle errors caused by inspect.stack() failures (issue 582)
Fix a reference to unexistent engine attribute (issue 593, issue 594)
Fix dynamic itemclass example usage of type() (issue 603)
Use lucasdemarchi/codespell to fix typos (issue 628)
Fix default value of attrs argument in SgmlLinkExtractor to be tuple (issue 661)
Fix XXE flaw in sitemap reader (issue 676)
Fix engine to support filtered start requests (issue 707)
Fix offsite middleware case on urls with no hostnames (issue 745)
Testsuite doesnt require PIL anymore (issue 585)
193
Fix wrong checks on subclassing of deprecated classes. closes #581 (commit 46d98d6)
Docs: 4-space indent for final spider example (commit 13846de)
Fix HtmlParserLinkExtractor and tests after #485 merge (commit 368a946)
BaseSgmlLinkExtractor: Fixed the missing space when the link has an inner tag (commit b566388)
BaseSgmlLinkExtractor: Added unit test of a link with an inner tag (commit c1cb418)
BaseSgmlLinkExtractor: Fixed unknown_endtag() so that it only set current_link=None when the end tag match
the opening tag (commit 7e4d627)
Fix tests for Travis-CI build (commit 76c7e20)
replace unencodeable codepoints with html entities. fixes #562 and #285 (commit 5f87b17)
RegexLinkExtractor: encode URL unicode value when creating Links (commit d0ee545)
Updated the tutorial crawl output with latest output. (commit 8da65de)
Updated shell docs with the crawler reference and fixed the actual shell output. (commit 875b9ab)
PEP8 minor edits. (commit f89efaf)
Expose current crawler in the scrapy shell. (commit 5349cec)
Unused re import and PEP8 minor edits. (commit 387f414)
Ignore Nones values when using the ItemLoader. (commit 0632546)
DOC Fixed HTTPCACHE_STORAGE typo in the default value which is now Filesystem instead Dbm. (commit
cde9a8c)
show ubuntu setup instructions as literal code (commit fb5c9c5)
Update Ubuntu installation instructions (commit 70fb105)
Merge pull request #550 from stray-leone/patch-1 (commit 6f70b6a)
modify the version of scrapy ubuntu package (commit 725900d)
fix 0.22.0 release date (commit af0219a)
fix typos in news.rst and remove (not released yet) header (commit b7f58f4)
194
Promote startup info on settings and middleware to INFO level (issue 520)
Support partials in get_func_args util (issue 506, issue:504)
Allow running indiviual tests via tox (issue 503)
Update extensions ignored by link extractors (issue 498)
Add middleware methods to get files/images/thumbs paths (issue 490)
Improve offsite middleware tests (issue 478)
Add a way to skip default Referer header set by RefererMiddleware (issue 475)
Do not send x-gzip in default Accept-Encoding header (issue 469)
Support defining http error handling using settings (issue 466)
Use modern python idioms wherever you find legacies (issue 497)
Improve and correct documentation (issue 527, issue 524, issue 521, issue 517, issue 512, issue 505, issue 502,
issue 489, issue 465, issue 460, issue 425, issue 536)
Fixes
Update Selector class imports in CrawlSpider template (issue 484)
Fix unexistent reference to engine.slots (issue 464)
Do not try to call body_as_unicode() on a non-TextResponse instance (issue 462)
Warn when subclassing XPathItemLoader, previously it only warned on instantiation. (issue 523)
Warn when subclassing XPathSelector, previously it only warned on instantiation. (issue 537)
Multiple fixes to memory stats (issue 531, issue 530, issue 529)
Fix overriding url in FormRequest.from_response() (issue 507)
Fix tests runner under pip 1.5 (issue 513)
Fix logging error when spider name is unicode (issue 479)
195
Request/Response url/body attributes are now immutable (modifying them had been deprecated for a long time)
ITEM_PIPELINES is now defined as a dict (instead of a list)
Sitemap spider can fetch alternate URLs (issue 360)
Selector.remove_namespaces() now remove namespaces from elements attributes. (issue 416)
Paved the road for Python 3.3+ (issue 435, issue 436, issue 431, issue 452)
New item exporter using native python types with nesting support (issue 366)
Tune HTTP1.1 pool size so it matches concurrency defined by settings (commit b43b5f575)
scrapy.mail.MailSender now can connect over TLS or upgrade using STARTTLS (issue 327)
New FilesPipeline with functionality factored out from ImagesPipeline (issue 370, issue 409)
Recommend Pillow instead of PIL for image handling (issue 317)
Added debian packages for Ubuntu quantal and raring (commit 86230c0)
Mock server (used for tests) can listen for HTTPS requests (issue 410)
Remove multi spider support from multiple core components (issue 422, issue 421, issue 420, issue 419, issue
423, issue 418)
Travis-CI now tests Scrapy changes against development versions of w3lib and queuelib python packages.
Add pypy 2.1 to continuous integration tests (commit ecfa7431)
Pylinted, pep8 and removed old-style exceptions from source (issue 430, issue 432)
Use importlib for parametric imports (issue 445)
Handle a regression introduced in Python 2.7.5 that affects XmlItemExporter (issue 372)
Bugfix crawling shutdown on SIGINT (issue 450)
Do not submit reset type inputs in FormRequest.from_response (commit b326b87)
Do not silence download errors when request errback raises an exception (commit 684cfc0)
Bugfixes
Fix tests under Django 1.6 (commit b6bed44c)
Lot of bugfixes to retry middleware under disconnections using HTTP 1.1 download handler
Fix inconsistencies among Twisted releases (issue 406)
Fix scrapy shell bugs (issue 418, issue 407)
Fix invalid variable name in setup.py (issue 429)
Fix tutorial references (issue 387)
Improve request-response docs (issue 391)
Improve best practices docs (issue 399, issue 400, issue 401, issue 402)
Improve django integration docs (issue 404)
Document bindaddress request meta (commit 37c24e01d7)
Improve Request class documentation (issue 226)
196
Other
Dropped Python 2.6 support (issue 448)
Add cssselect python package as install dependency
Drop libxml2 and multi selectors backend support, lxml is required from now on.
Minimum Twisted version increased to 10.0.0, dropped Twisted 8.0 support.
Running test suite now requires mock python library (issue 390)
Thanks
Thanks to everyone who contribute to this release!
List of contributors sorted by number of commits:
69
37
13
9
9
8
8
6
3
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
197
199
130
97
20
13
12
11
5
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
200
Fixed error message formatting. log.err() doesnt support cool formatting and when error occurred, the message
was: ERROR: Error processing %(item)s (commit c16150c)
lint and improve images pipeline error logging (commit 56b45fc)
fixed doc typos (commit 243be84)
add documentation topics: Broad Crawls & Common Practies (commit 1fbb715)
fix bug in scrapy parse command when spider is not specified explicitly. closes #209 (commit c72e682)
Update docs/topics/commands.rst (commit 28eac7a)
201
downloader handlers (DOWNLOAD_HANDLERS setting) now receive settings as the first argument of the constructor
replaced memory usage acounting with (more portable) resource module, removed scrapy.utils.memory
module
removed signal: scrapy.mail.mail_sent
removed TRACK_REFS setting, now trackrefs is always enabled
DBM is now the default storage backend for HTTP cache middleware
number of log messages (per level) are now tracked through Scrapy stats (stat name: log_count/LEVEL)
number received responses are now tracked through Scrapy stats (stat name: response_received_count)
removed scrapy.log.started attribute
7.1.29 0.14.4
added precise to supported ubuntu distros (commit b7e46df)
fixed bug in json-rpc webservice reported in https://groups.google.com/forum/#!topic/scrapyusers/qgVBmFybNAQ/discussion. also removed no longer supported run command from extras/scrapy-ws.py
(commit 340fbdb)
meta tag attributes for content-type http equiv can be in any order. #123 (commit 0cb68af)
replace import Image by more standard from PIL import Image. closes #88 (commit 4d17048)
return trial status as bin/runtests.sh exit value. #118 (commit b7b2e7f)
7.1.30 0.14.3
forgot to include pydispatch license. #118 (commit fd85f9c)
include egg files used by testsuite in source distribution. #118 (commit c897793)
update docstring in project template to avoid confusion with genspider command, which may be considered as
an advanced feature. refs #107 (commit 2548dcc)
added note to docs/topics/firebug.rst about google directory being shut down (commit 668e352)
dont discard slot when empty, just save in another dict in order to recycle if needed again. (commit 8e9f607)
do not fail handling unicode xpaths in libxml2 backed selectors (commit b830e95)
fixed minor mistake in Request objects documentation (commit bf3c9ee)
fixed minor defect in link extractors documentation (commit ba14f38)
removed some obsolete remaining code related to sqlite support in scrapy (commit 0665175)
7.1.31 0.14.2
move buffer pointing to start of file before computing checksum. refs #92 (commit 6a5bef2)
Compute image checksum before persisting images. closes #92 (commit 9817df1)
remove leaking references in cached failures (commit 673a120)
fixed bug in MemoryUsage extension: get_engine_status() takes exactly 1 argument (0 given) (commit 11133e9)
203
7.1.32 0.14.1
extras/makedeb.py: no longer obtaining version from git (commit caffe0e)
bumped version to 0.14.1 (commit 6cb9e1c)
fixed reference to tutorial directory (commit 4b86bd6)
doc: removed duplicated callback argument from Request.replace() (commit 1aeccdd)
fixed formatting of scrapyd doc (commit 8bf19e6)
Dump stacks for all running threads and fix engine status dumped by StackTraceDump extension (commit
14a8e6e)
added comment about why we disable ssl on boto images upload (commit 5223575)
SSL handshaking hangs when doing too many parallel connections to S3 (commit 63d583d)
change tutorial to follow changes on dmoz site (commit bcb3198)
Avoid _disconnectedDeferred AttributeError exception in Twisted>=11.1.0 (commit 98f3f87)
allow spider to set autothrottle max concurrency (commit 175a4b5)
7.1.33 0.14
New features and settings
Support for AJAX crawleable urls
New persistent scheduler that stores requests on disk, allowing to suspend and resume crawls (r2737)
added -o option to scrapy crawl, a shortcut for dumping scraped items into a file (or standard output using
-)
Added support for passing custom settings to Scrapyd schedule.json api (r2779, r2783)
New ChunkedTransferMiddleware (enabled by default) to support chunked transfer encoding (r2769)
Add boto 2.0 support for S3 downloader handler (r2763)
Added marshal to formats supported by feed exports (r2744)
In request errbacks, offending requests are now received in failure.request attribute (r2738)
Big downloader refactoring to support per domain/ip concurrency limits (r2732)
CONCURRENT_REQUESTS_PER_SPIDER setting has been deprecated and replaced by:
204
* CONCURRENT_REQUESTS,
CONCURRENT_REQUESTS_PER_IP
CONCURRENT_REQUESTS_PER_DOMAIN ,
205
7.1.34 0.12
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
New features and improvements
Passed item is now sent in the item argument of the item_passed (#273)
Added verbose option to scrapy version command, useful for bug reports (#298)
HTTP cache now stored by default in the project data dir (#279)
Added project data storage directory (#276, #277)
Documented file structure of Scrapy projects (see command-line tool doc)
New lxml backend for XPath selectors (#147)
206
7.1.35 0.10
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
New features and improvements
New Scrapy service called scrapyd for deploying Scrapy crawlers in production (#218) (documentation available)
Simplified Images pipeline usage which doesnt require subclassing your own images pipeline now (#217)
Scrapy shell now shows the Scrapy log by default (#206)
Refactored execution queue in a common base code and pluggable backends called spider queues (#220)
New persistent spider queue (based on SQLite) (#198), available by default, which allows to start Scrapy in
server mode and then schedule spiders to run.
Added documentation for Scrapy command-line tool and all its available sub-commands. (documentation available)
7.1. Release notes
207
singleton
moved
to
208
to
default per-command settings are now specified in the default_settings attribute of command object
class (#201)
changed arguments of Item pipeline process_item() method from (spider, item) to (item, spider)
backwards compatibility kept (with deprecation warning)
moved scrapy.core.signals module to scrapy.signals
backwards compatibility kept (with deprecation warning)
moved scrapy.core.exceptions module to scrapy.exceptions
backwards compatibility kept (with deprecation warning)
added handles_request() class method to BaseSpider
dropped scrapy.log.exc() function (use scrapy.log.err() instead)
dropped component argument of scrapy.log.msg() function
dropped scrapy.log.log_level attribute
Added from_settings() class methods to Spider Manager, and Item Pipeline Manager
Changes to settings
Added HTTPCACHE_IGNORE_SCHEMES setting to ignore certain schemes on !HttpCacheMiddleware (#225)
Added SPIDER_QUEUE_CLASS setting which defines the spider queue to use (#220)
Added KEEP_ALIVE setting (#220)
Removed SERVICE_QUEUE setting (#220)
Removed COMMANDS_SETTINGS_MODULE setting (#201)
Renamed REQUEST_HANDLERS to DOWNLOAD_HANDLERS and make download handlers classes (instead of
functions)
7.1.36 0.9
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
New features and improvements
Added SMTP-AUTH support to scrapy.mail
New settings added: MAIL_USER, MAIL_PASS (r2065 | #149)
Added new scrapy-ctl view command - To view URL in the browser, as seen by Scrapy (r2039)
Added web service for controlling Scrapy process (this also deprecates the web console. (r2053 | #167)
Support for running Scrapy as a service, for production systems (r1988, r2054, r2055, r2056, r2057 | #168)
Added wrapper induction library (documentation only available in source code for now). (r2011)
Simplified and improved response encoding support (r1961, r1969)
Added LOG_ENCODING setting (r1956, documentation available)
Added RANDOMIZE_DOWNLOAD_DELAY setting (enabled by default) (r1923, doc available)
7.1. Release notes
209
7.1.37 0.8
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
New features
Added DEFAULT_RESPONSE_ENCODING setting (r1809)
Added dont_click argument to FormRequest.from_response() method (r1813, r1816)
Added clickdata argument to FormRequest.from_response() method (r1802, r1803)
Added support for HTTP proxies (HttpProxyMiddleware) (r1781, r1785)
Offsite spider middleware now logs messages when filtering out requests (r1841)
Backwards-incompatible changes
Changed scrapy.utils.response.get_meta_refresh() signature (r1804)
Removed deprecated scrapy.item.ScrapedItem class - use scrapy.item.Item instead (r1838)
Removed deprecated scrapy.xpath module - use scrapy.selector instead. (r1836)
Removed deprecated core.signals.domain_open signal - use core.signals.domain_opened
instead (r1822)
210
environment
variable
use
function
use
7.1.38 0.7
First release of Scrapy.
211
the
most
recent
version
of
this
document
at
There are many ways to contribute to Scrapy. Here are some of them:
Blog about Scrapy. Tell the world how youre using Scrapy. This will help newcomers with more examples and
the Scrapy project to increase its visibility.
Report bugs and request features in the issue tracker, trying to follow the guidelines detailed in Reporting bugs
below.
Submit patches for new functionality and/or bug fixes. Please read Writing patches and Submitting patches
below for details on how to write and submit a patch.
Join the scrapy-users mailing list and share your ideas on how to improve Scrapy. Were always open to suggestions.
212
7.2.7 Tests
Tests are implemented using the Twisted unit-testing framework, running tests requires tox.
Running tests
To run all tests go to the root directory of Scrapy source code and run:
tox
To run a specific test (say tests/test_loader.py) use:
tox -- tests/test_loader.py
7.2. Contributing to Scrapy
213
Writing tests
All functionality (including new features and bug fixes) must include a test case to check that it works as expected, so
please include tests for your patches if you want them to get accepted sooner.
Scrapy uses unit-tests, which are located in the tests/ directory. Their module name typically resembles the full path
of the module theyre testing. For example, the item loaders code is in:
scrapy.loader
214
scrapy.http, 71
scrapy.item, 48
scrapy.contracts, 118
scrapy.linkextractors, 79
scrapy.contracts.default, 118
scrapy.linkextractors.lxmlhtml, 79
scrapy.crawler, 167
scrapy.loader, 51
scrapy.downloadermiddlewares, 148
scrapy.downloadermiddlewares.ajaxcrawl, scrapy.loader.processors, 58
scrapy.mail, 106
157
scrapy.pipelines.files, 137
scrapy.downloadermiddlewares.chunked,
scrapy.pipelines.images, 138
154
scrapy.selector, 45
scrapy.downloadermiddlewares.cookies,
scrapy.settings, 169
149
scrapy.signalmanager, 172
scrapy.downloadermiddlewares.defaultheaders,
scrapy.signals, 174
150
scrapy.spidermiddlewares, 158
scrapy.downloadermiddlewares.downloadtimeout,
scrapy.spidermiddlewares.depth, 160
150
scrapy.spidermiddlewares.httperror, 160
scrapy.downloadermiddlewares.httpauth,
scrapy.spidermiddlewares.offsite, 161
151
scrapy.downloadermiddlewares.httpcache, scrapy.spidermiddlewares.referer, 161
scrapy.spidermiddlewares.urllength, 161
151
scrapy.spiders, 27
scrapy.downloadermiddlewares.httpcompression,
scrapy.statscollectors, 105
154
scrapy.downloadermiddlewares.httpproxy, scrapy.telnet, 108
scrapy.utils.log, 104
155
scrapy.utils.trackref, 132
scrapy.downloadermiddlewares.redirect,
155
scrapy.downloadermiddlewares.retry, 156
scrapy.downloadermiddlewares.robotstxt,
157
scrapy.downloadermiddlewares.stats, 157
scrapy.downloadermiddlewares.useragent,
157
scrapy.exceptions, 98
scrapy.exporters, 177
scrapy.extensions.closespider, 165
scrapy.extensions.corestats, 164
scrapy.extensions.debug, 166
scrapy.extensions.logstats, 164
scrapy.extensions.memdebug, 165
scrapy.extensions.memusage, 164
scrapy.extensions.statsmailer, 166
215
216
Index
Symbols
__nonzero__() (scrapy.selector.Selector method), 46
__nonzero__() (scrapy.selector.SelectorList method), 46
A
adapt_response()
(scrapy.spiders.XMLFeedSpider
method), 33
add_css() (scrapy.loader.ItemLoader method), 57
add_value() (scrapy.loader.ItemLoader method), 55
add_xpath() (scrapy.loader.ItemLoader method), 56
adjust_request_args() (scrapy.contracts.Contract method),
118
AJAXCRAWL_ENABLED
setting, 157
AjaxCrawlMiddleware
(class
in
scrapy.downloadermiddlewares.ajaxcrawl),
157
allowed_domains (scrapy.spiders.Spider attribute), 28
AUTOTHROTTLE_DEBUG
setting, 141
AUTOTHROTTLE_ENABLED
setting, 141
AUTOTHROTTLE_MAX_DELAY
setting, 141
AUTOTHROTTLE_START_DELAY
setting, 141
AWS_ACCESS_KEY_ID
setting, 82
AWS_SECRET_ACCESS_KEY
setting, 82
B
BaseItemExporter (class in scrapy.exporters), 179
bench
command, 26
bindaddress
reqmeta, 74
body (scrapy.http.Request attribute), 73
body (scrapy.http.Response attribute), 77
C
check
command, 23
ChunkedTransferMiddleware
(class
in
scrapy.downloadermiddlewares.chunked),
154
clear_stats()
(scrapy.statscollectors.StatsCollector
method), 173
close_spider(), 64
close_spider()
(scrapy.statscollectors.StatsCollector
method), 173
closed() (scrapy.spiders.Spider method), 29
CloseSpider, 98
CLOSESPIDER_ERRORCOUNT
setting, 165
CLOSESPIDER_ITEMCOUNT
setting, 165
CLOSESPIDER_PAGECOUNT
setting, 165
CLOSESPIDER_TIMEOUT
setting, 165
command
bench, 26
check, 23
crawl, 22
edit, 23
fetch, 23
genspider, 22
list, 23
parse, 24
runspider, 25
settings, 25
shell, 24
startproject, 21
version, 26
view, 24
217
COMMANDS_MODULE
setting, 26
Compose (class in scrapy.loader.processors), 59
COMPRESSION_ENABLED
setting, 154
CONCURRENT_ITEMS
setting, 83
CONCURRENT_REQUESTS
setting, 83
CONCURRENT_REQUESTS_PER_DOMAIN
setting, 83
CONCURRENT_REQUESTS_PER_IP
setting, 83
configure_logging() (in module scrapy.utils.log), 104
connect()
(scrapy.signalmanager.SignalManager
method), 172
context (scrapy.loader.ItemLoader attribute), 57
Contract (class in scrapy.contracts), 118
cookiejar
reqmeta, 149
COOKIES_DEBUG
setting, 150
COOKIES_ENABLED
setting, 150
CookiesMiddleware
(class
in
scrapy.downloadermiddlewares.cookies),
149
copy() (scrapy.http.Request method), 73
copy() (scrapy.http.Response method), 77
copy() (scrapy.settings.Settings method), 171
CoreStats (class in scrapy.extensions.corestats), 164
crawl
command, 22
crawl() (scrapy.crawler.Crawler method), 168
crawl() (scrapy.crawler.CrawlerProcess method), 168
crawl() (scrapy.crawler.CrawlerRunner method), 168
Crawler (class in scrapy.crawler), 167
crawler (scrapy.spiders.Spider attribute), 28
CrawlerProcess (class in scrapy.crawler), 168
CrawlerRunner (class in scrapy.crawler), 168
crawlers (scrapy.crawler.CrawlerProcess attribute), 169
crawlers (scrapy.crawler.CrawlerRunner attribute), 168
CrawlSpider (class in scrapy.spiders), 31
css() (scrapy.http.TextResponse method), 78
css() (scrapy.selector.Selector method), 45
css() (scrapy.selector.SelectorList method), 46
CSVFeedSpider (class in scrapy.spiders), 34
CsvItemExporter (class in scrapy.exporters), 180
custom_settings (scrapy.spiders.Spider attribute), 28
setting, 83
default_item_class (scrapy.loader.ItemLoader attribute),
57
default_output_processor (scrapy.loader.ItemLoader attribute), 57
DEFAULT_REQUEST_HEADERS
setting, 83
default_selector_class (scrapy.loader.ItemLoader attribute), 57
DefaultHeadersMiddleware
(class
in
scrapy.downloadermiddlewares.defaultheaders),
150
delimiter (scrapy.spiders.CSVFeedSpider attribute), 34
DEPTH_LIMIT
setting, 84
DEPTH_PRIORITY
setting, 84
DEPTH_STATS
setting, 84
DEPTH_STATS_VERBOSE
setting, 84
DepthMiddleware
(class
in
scrapy.spidermiddlewares.depth), 160
disconnect()
(scrapy.signalmanager.SignalManager
method), 172
disconnect_all() (scrapy.signalmanager.SignalManager
method), 173
DNS_TIMEOUT
setting, 84
DNSCACHE_ENABLED
setting, 84
DNSCACHE_SIZE
setting, 84
dont_cache
reqmeta, 151
dont_obey_robotstxt
reqmeta, 157
dont_redirect
reqmeta, 155
dont_retry
reqmeta, 156
DOWNLOAD_DELAY
setting, 85
DOWNLOAD_HANDLERS
setting, 86
DOWNLOAD_HANDLERS_BASE
setting, 86
DOWNLOAD_MAXSIZE
setting, 86
download_maxsize
D
reqmeta, 87
DOWNLOAD_TIMEOUT
default_input_processor (scrapy.loader.ItemLoader atsetting, 86
tribute), 57
download_timeout
DEFAULT_ITEM_CLASS
218
Index
reqmeta, 74
F
DOWNLOAD_WARNSIZE
FEED_EXPORT_FIELDS
setting, 87
setting, 70
DOWNLOADER
FEED_EXPORTERS
setting, 85
setting, 70
DOWNLOADER_MIDDLEWARES
FEED_EXPORTERS_BASE
setting, 85
setting, 71
DOWNLOADER_MIDDLEWARES_BASE
FEED_FORMAT
setting, 85
setting, 70
DOWNLOADER_STATS
FEED_STORAGES
setting, 85
setting, 70
DownloaderMiddleware
(class
in FEED_STORAGES_BASE
scrapy.downloadermiddlewares), 148
setting, 70
DownloaderStats
(class
in FEED_STORE_EMPTY
scrapy.downloadermiddlewares.stats), 157
setting, 70
DownloadTimeoutMiddleware
(class
in FEED_URI
scrapy.downloadermiddlewares.downloadtimeout),
setting, 69
150
fetch
DropItem, 98
command, 23
DummyStatsCollector (class in scrapy.statscollectors), Field (class in scrapy.item), 51
106
fields (scrapy.item.Item attribute), 51
DUPEFILTER_CLASS
fields_to_export (scrapy.exporters.BaseItemExporter atsetting, 87
tribute), 179
DUPEFILTER_DEBUG
FILES_EXPIRES
setting, 87
setting, 136
FILES_STORE
setting, 135
edit
FilesPipeline (class in scrapy.pipelines.files), 137
command, 23
find_by_request() (scrapy.loader.SpiderLoader method),
EDITOR
172
setting, 87
finish_exporting() (scrapy.exporters.BaseItemExporter
encoding (scrapy.exporters.BaseItemExporter attribute),
method), 179
179
flags (scrapy.http.Response attribute), 77
encoding (scrapy.http.TextResponse attribute), 78
FormRequest (class in scrapy.http), 74
engine (scrapy.crawler.Crawler attribute), 167
freeze() (scrapy.settings.Settings method), 171
engine_started
from_crawler(), 64
signal, 174
from_crawler() (scrapy.spiders.Spider method), 28
engine_started() (in module scrapy.signals), 174
from_response()
(scrapy.http.FormRequest
class
engine_stopped
method), 75
signal, 174
from_settings() (scrapy.loader.SpiderLoader method),
engine_stopped() (in module scrapy.signals), 174
172
export_empty_fields (scrapy.exporters.BaseItemExporter from_settings() (scrapy.mail.MailSender class method),
attribute), 179
107
export_item()
(scrapy.exporters.BaseItemExporter frozencopy() (scrapy.settings.Settings method), 171
method), 179
G
EXTENSIONS
setting, 87
genspider
extensions (scrapy.crawler.Crawler attribute), 167
command, 22
EXTENSIONS_BASE
get() (scrapy.settings.Settings method), 170
setting, 88
get_collected_values()
(scrapy.loader.ItemLoader
extract() (scrapy.selector.Selector method), 46
method), 57
extract() (scrapy.selector.SelectorList method), 46
get_css() (scrapy.loader.ItemLoader method), 56
get_input_processor()
(scrapy.loader.ItemLoader
method), 57
Index
219
220
Index
MEMDEBUG_ENABLED
setting, 89
MEMDEBUG_NOTIFY
setting, 89
MemoryStatsCollector (class in scrapy.statscollectors),
105
MEMUSAGE_ENABLED
setting, 90
MEMUSAGE_LIMIT_MB
setting, 90
MEMUSAGE_NOTIFY_MAIL
setting, 90
MEMUSAGE_REPORT
setting, 90
MEMUSAGE_WARNING_MB
setting, 90
meta (scrapy.http.Request attribute), 73
meta (scrapy.http.Response attribute), 77
METAREFRESH_ENABLED
setting, 156
MetaRefreshMiddleware
(class
in
scrapy.downloadermiddlewares.redirect),
155
method (scrapy.http.Request attribute), 72
min_value()
(scrapy.statscollectors.StatsCollector
method), 173
L
list
command, 23
list() (scrapy.loader.SpiderLoader method), 172
load() (scrapy.loader.SpiderLoader method), 172
load_item() (scrapy.loader.ItemLoader method), 57
log() (scrapy.spiders.Spider method), 29
LOG_DATEFORMAT
setting, 89
LOG_ENABLED
setting, 88
LOG_ENCODING
setting, 88
LOG_FILE
setting, 89
LOG_FORMAT
setting, 89
LOG_LEVEL
setting, 89
LOG_STDOUT
setting, 89
logger (scrapy.spiders.Spider attribute), 28
LogStats (class in scrapy.extensions.logstats), 164
LxmlLinkExtractor
(class
scrapy.linkextractors.lxmlhtml), 79
N
name (scrapy.spiders.Spider attribute), 27
namespaces (scrapy.spiders.XMLFeedSpider attribute),
33
in
NEWSPIDER_MODULE
setting, 91
NotConfigured, 98
NotSupported, 99
MAIL_FROM
setting, 107
MAIL_HOST
setting, 107
MAIL_PASS
setting, 107
MAIL_PORT
setting, 107
MAIL_SSL
setting, 108
MAIL_TLS
setting, 108
MAIL_USER
setting, 107
MailSender (class in scrapy.mail), 106
make_requests_from_url()
(scrapy.spiders.Spider
method), 29
MapCompose (class in scrapy.loader.processors), 59
max_value()
(scrapy.statscollectors.StatsCollector
method), 173
Index
O
object_ref (class in scrapy.utils.trackref), 132
OffsiteMiddleware
(class
in
scrapy.spidermiddlewares.offsite), 161
open_spider(), 64
open_spider()
(scrapy.statscollectors.StatsCollector
method), 173
P
parse
command, 24
parse() (scrapy.spiders.Spider method), 29
parse_node() (scrapy.spiders.XMLFeedSpider method),
33
parse_row() (scrapy.spiders.CSVFeedSpider method), 34
parse_start_url() (scrapy.spiders.CrawlSpider method),
31
PickleItemExporter (class in scrapy.exporters), 181
221
Index
command, 25
S
SCHEDULER
setting, 92
ScrapesContract (class in scrapy.contracts.default), 118
scrapy.contracts (module), 118
scrapy.contracts.default (module), 118
scrapy.crawler (module), 167
scrapy.downloadermiddlewares (module), 148
scrapy.downloadermiddlewares.ajaxcrawl (module), 157
scrapy.downloadermiddlewares.chunked (module), 154
scrapy.downloadermiddlewares.cookies (module), 149
scrapy.downloadermiddlewares.defaultheaders (module),
150
scrapy.downloadermiddlewares.downloadtimeout (module), 150
scrapy.downloadermiddlewares.httpauth (module), 151
scrapy.downloadermiddlewares.httpcache (module), 151
scrapy.downloadermiddlewares.httpcompression (module), 154
scrapy.downloadermiddlewares.httpproxy (module), 155
scrapy.downloadermiddlewares.redirect (module), 155
scrapy.downloadermiddlewares.retry (module), 156
scrapy.downloadermiddlewares.robotstxt (module), 157
scrapy.downloadermiddlewares.stats (module), 157
scrapy.downloadermiddlewares.useragent (module), 157
scrapy.exceptions (module), 98
scrapy.exporters (module), 177
scrapy.extensions.closespider (module), 165
scrapy.extensions.closespider.CloseSpider
(class
in
scrapy.extensions.closespider), 165
scrapy.extensions.corestats (module), 164
scrapy.extensions.debug (module), 166
scrapy.extensions.debug.Debugger
(class
in
scrapy.extensions.debug), 166
scrapy.extensions.debug.StackTraceDump
(class
in
scrapy.extensions.debug), 166
scrapy.extensions.logstats (module), 164
scrapy.extensions.memdebug (module), 165
scrapy.extensions.memdebug.MemoryDebugger (class in
scrapy.extensions.memdebug), 165
scrapy.extensions.memusage (module), 164
scrapy.extensions.memusage.MemoryUsage (class in
scrapy.extensions.memusage), 164
scrapy.extensions.statsmailer (module), 166
scrapy.extensions.statsmailer.StatsMailer
(class
in
scrapy.extensions.statsmailer), 166
scrapy.http (module), 71
scrapy.item (module), 48
scrapy.linkextractors (module), 79
scrapy.linkextractors.lxmlhtml (module), 79
scrapy.loader (module), 51, 172
scrapy.loader.processors (module), 58
Index
223
CONCURRENT_ITEMS, 83
CONCURRENT_REQUESTS, 83
CONCURRENT_REQUESTS_PER_DOMAIN, 83
CONCURRENT_REQUESTS_PER_IP, 83
COOKIES_DEBUG, 150
COOKIES_ENABLED, 150
DEFAULT_ITEM_CLASS, 83
DEFAULT_REQUEST_HEADERS, 83
DEPTH_LIMIT, 84
DEPTH_PRIORITY, 84
DEPTH_STATS, 84
DEPTH_STATS_VERBOSE, 84
DNS_TIMEOUT, 84
DNSCACHE_ENABLED, 84
DNSCACHE_SIZE, 84
DOWNLOAD_DELAY, 85
DOWNLOAD_HANDLERS, 86
DOWNLOAD_HANDLERS_BASE, 86
DOWNLOAD_MAXSIZE, 86
DOWNLOAD_TIMEOUT, 86
DOWNLOAD_WARNSIZE, 87
DOWNLOADER, 85
DOWNLOADER_MIDDLEWARES, 85
DOWNLOADER_MIDDLEWARES_BASE, 85
DOWNLOADER_STATS, 85
DUPEFILTER_CLASS, 87
DUPEFILTER_DEBUG, 87
EDITOR, 87
EXTENSIONS, 87
EXTENSIONS_BASE, 88
FEED_EXPORT_FIELDS, 70
FEED_EXPORTERS, 70
FEED_EXPORTERS_BASE, 71
FEED_FORMAT, 70
FEED_STORAGES, 70
FEED_STORAGES_BASE, 70
FEED_STORE_EMPTY, 70
FEED_URI, 69
FILES_EXPIRES, 136
FILES_STORE, 135
HTTPCACHE_DBM_MODULE, 154
HTTPCACHE_DIR, 153
HTTPCACHE_ENABLED, 153
HTTPCACHE_EXPIRATION_SECS, 153
HTTPCACHE_GZIP, 154
HTTPCACHE_IGNORE_HTTP_CODES, 153
HTTPCACHE_IGNORE_MISSING, 154
HTTPCACHE_IGNORE_SCHEMES, 154
HTTPCACHE_POLICY, 154
HTTPCACHE_STORAGE, 154
HTTPERROR_ALLOW_ALL, 160
HTTPERROR_ALLOWED_CODES, 160
IMAGES_EXPIRES, 136
IMAGES_MIN_HEIGHT, 137
224
IMAGES_MIN_WIDTH, 137
IMAGES_STORE, 135
IMAGES_THUMBS, 136
ITEM_PIPELINES, 88
ITEM_PIPELINES_BASE, 88
LOG_DATEFORMAT, 89
LOG_ENABLED, 88
LOG_ENCODING, 88
LOG_FILE, 89
LOG_FORMAT, 89
LOG_LEVEL, 89
LOG_STDOUT, 89
MAIL_FROM, 107
MAIL_HOST, 107
MAIL_PASS, 107
MAIL_PORT, 107
MAIL_SSL, 108
MAIL_TLS, 108
MAIL_USER, 107
MEMDEBUG_ENABLED, 89
MEMDEBUG_NOTIFY, 89
MEMUSAGE_ENABLED, 90
MEMUSAGE_LIMIT_MB, 90
MEMUSAGE_NOTIFY_MAIL, 90
MEMUSAGE_REPORT, 90
MEMUSAGE_WARNING_MB, 90
METAREFRESH_ENABLED, 156
NEWSPIDER_MODULE, 91
RANDOMIZE_DOWNLOAD_DELAY, 91
REACTOR_THREADPOOL_MAXSIZE, 91
REDIRECT_ENABLED, 155
REDIRECT_MAX_METAREFRESH_DELAY, 91,
156
REDIRECT_MAX_TIMES, 91, 155
REDIRECT_PRIORITY_ADJUST, 91
REFERER_ENABLED, 161
RETRY_ENABLED, 156
RETRY_HTTP_CODES, 156
RETRY_TIMES, 156
ROBOTSTXT_OBEY, 92
SCHEDULER, 92
SPIDER_CONTRACTS, 92
SPIDER_CONTRACTS_BASE, 92
SPIDER_LOADER_CLASS, 92
SPIDER_MIDDLEWARES, 92
SPIDER_MIDDLEWARES_BASE, 92
SPIDER_MODULES, 93
STATS_CLASS, 93
STATS_DUMP, 93
STATSMAILER_RCPTS, 93
TELNETCONSOLE_ENABLED, 93
TELNETCONSOLE_HOST, 110
TELNETCONSOLE_PORT, 93, 110
TEMPLATES_DIR, 94
Index
URLLENGTH_LIMIT, 94
USER_AGENT, 94
settings
command, 25
Settings (class in scrapy.settings), 169
settings (scrapy.crawler.Crawler attribute), 167
settings (scrapy.spiders.Spider attribute), 28
SETTINGS_PRIORITIES (in module scrapy.settings),
169
shell
command, 24
signal
engine_started, 174
engine_stopped, 174
item_dropped, 174
item_scraped, 174
request_dropped, 176
request_scheduled, 176
response_downloaded, 176
response_received, 176
spider_closed, 175
spider_error, 175
spider_idle, 175
spider_opened, 175
update_telnet_vars, 110
SignalManager (class in scrapy.signalmanager), 172
signals (scrapy.crawler.Crawler attribute), 167
sitemap_alternate_links (scrapy.spiders.SitemapSpider
attribute), 35
sitemap_follow (scrapy.spiders.SitemapSpider attribute),
35
sitemap_rules (scrapy.spiders.SitemapSpider attribute),
35
sitemap_urls (scrapy.spiders.SitemapSpider attribute), 35
SitemapSpider (class in scrapy.spiders), 35
Spider (class in scrapy.spiders), 27
spider (scrapy.crawler.Crawler attribute), 167
spider_closed
signal, 175
spider_closed() (in module scrapy.signals), 175
SPIDER_CONTRACTS
setting, 92
SPIDER_CONTRACTS_BASE
setting, 92
spider_error
signal, 175
spider_error() (in module scrapy.signals), 175
spider_idle
signal, 175
spider_idle() (in module scrapy.signals), 175
SPIDER_LOADER_CLASS
setting, 92
SPIDER_MIDDLEWARES
setting, 92
Index
SPIDER_MIDDLEWARES_BASE
setting, 92
SPIDER_MODULES
setting, 93
spider_opened
signal, 175
spider_opened() (in module scrapy.signals), 175
spider_stats (scrapy.statscollectors.MemoryStatsCollector
attribute), 105
SpiderLoader (class in scrapy.loader), 172
SpiderMiddleware (class in scrapy.spidermiddlewares),
158
start() (scrapy.crawler.CrawlerProcess method), 169
start_exporting()
(scrapy.exporters.BaseItemExporter
method), 179
start_requests() (scrapy.spiders.Spider method), 28
start_urls (scrapy.spiders.Spider attribute), 28
startproject
command, 21
stats (scrapy.crawler.Crawler attribute), 167
STATS_CLASS
setting, 93
STATS_DUMP
setting, 93
StatsCollector (class in scrapy.statscollectors), 173
STATSMAILER_RCPTS
setting, 93
status (scrapy.http.Response attribute), 76
stop() (scrapy.crawler.CrawlerProcess method), 169
stop() (scrapy.crawler.CrawlerRunner method), 168
T
TakeFirst (class in scrapy.loader.processors), 59
TELNETCONSOLE_ENABLED
setting, 93
TELNETCONSOLE_HOST
setting, 110
TELNETCONSOLE_PORT
setting, 93, 110
TEMPLATES_DIR
setting, 94
TextResponse (class in scrapy.http), 78
U
update_telnet_vars
signal, 110
update_telnet_vars() (in module scrapy.telnet), 110
url (scrapy.http.Request attribute), 72
url (scrapy.http.Response attribute), 76
UrlContract (class in scrapy.contracts.default), 118
urljoin() (scrapy.http.Response method), 77
URLLENGTH_LIMIT
setting, 94
225
UrlLengthMiddleware
(class
in
scrapy.spidermiddlewares.urllength), 161
USER_AGENT
setting, 94
UserAgentMiddleware
(class
in
scrapy.downloadermiddlewares.useragent),
157
V
version
command, 26
view
command, 24
X
XMLFeedSpider (class in scrapy.spiders), 32
XmlItemExporter (class in scrapy.exporters), 180
XmlResponse (class in scrapy.http), 79
xpath() (scrapy.http.TextResponse method), 78
xpath() (scrapy.selector.Selector method), 45
xpath() (scrapy.selector.SelectorList method), 46
226
Index