Learning Scrapy - Sample Chapter
Learning Scrapy - Sample Chapter
P U B L I S H I N G
pl
$ 34.99 US
22.99 UK
Sa
m
C o m m u n i t y
Dimitrios Kouzis-Loukas
Learning Scrapy
Learning Scrapy
ee
E x p e r i e n c e
Learning Scrapy
Learn the art of efficient web scraping and crawling with Python
D i s t i l l e d
Dimitrios Kouzis-Loukas
software developer. He uses his acquired knowledge and expertise to teach a wide
range of audiences how to write great software, as well.
He studied and mastered several disciplines, including mathematics, physics, and
microelectronics. His thorough understanding of these subjects helped him raise his
standards beyond the scope of "pragmatic solutions." He knows that true solutions
should be as certain as the laws of physics, as robust as ECC memories, and as
universal as mathematics.
Preface
Let me take a wild guess. One of these two stories is curiously similar to yours:
Your first encounter with Scrapy was while searching the net for something along
the lines of "web scraping Python". You had a quick look at it and thought, "This is
too complex...I just need something simple." You went on and developed a Python
script using requests, struggled a bit with beautiful soup, but finally made something
cool. It was kind of slow, so you let it run overnight. You restarted it a few times,
ignored some semi-broken links and non-English characters, and in the morning,
most of the website was proudly on your hard disk. Sadly, for some unknown
reason, you didn't want to see your code again. The next time you had to scrape
something, you went directly to scrapy.org and this time the documentation made
perfect sense. Scrapy now felt like it was elegantly and effortlessly solving all of the
problems that you faced, and it even took care of problems you hadn't thought of
yet. You never looked back.
Alternatively, your first encounter with Scrapy was while doing research for a webscraping project. You needed something robust, fast, and enterprise-grade, so most
of the fancy one-click web-scraping tools were out of question. You needed it to be
simple but at the same time flexible enough to allow you to customize its behavior
for different sources, provide different types of output feeds, and reliably run 24/7
in an automated manner. Companies that provided scraping as a service seemed too
expensive and you were more comfortable using open source solutions than feeling
locked on vendors. From the very beginning, Scrapy looked like a clear winner.
Preface
No matter how you got here, I'm glad to meet you on a book that is entirely devoted
to Scrapy. Scrapy is the secret of web-scraping experts throughout the world. They
know how to maneuver it to save them hours of work, deliver stellar performance,
and keep their hosting bills to an absolute minimum. If you are less experienced and
you want to achieve their results, unfortunately, Google will do you a disservice.
The majority of Scrapy information on the Web is either simplistic and inefficient
or complex. This book is an absolute necessity for everyone who wants accurate,
accessible, and well-organized information on how to make the most out of Scrapy. It
is my hope that it will help the Scrapy community grow even further and give it the
wide adoption that it rightfully deserves.
Preface
Chapter 8, Programming Scrapy, takes our knowledge to a whole new level by showing
us how to use the underlying Twisted engine and Scrapy's architecture to extend
every aspect of its functionality.
Chapter 9, Pipeline Recipes, presents numerous examples where we alter Scrapy's
functionality to insert into databases such as MySQL, Elasticsearch, and Redis,
interface APIs, and legacy applications with virtually no degradation of performance.
Chapter 10, Understanding Scrapy's Performance, will help us understand how Scrapy
spends its time, and what exactly we need to do to increase its performance.
Chapter 11, Distributed Crawling with Scrapyd and Real-Time Analytics, is our final
chapter showing how to use scrapyd in multiple servers to achieve horizontal
scalability, and how to feed crawled data to an Apache Spark server that performs
stream analytics on it.
Introducing Scrapy
Welcome to your Scrapy journey. With this book, we aim to take you from a Scrapy
beginnersomeone who has little or no experience with Scrapyto a level where
you will be able to confidently use this powerful framework to scrape large datasets
from the web or other sources. In this chapter, we will introduce you to Scrapy and
talk to you about some of the great things you can achieve with it.
Hello Scrapy
Scrapy is a robust web framework for scraping data from various sources. As a
casual web user, you will often find yourself wishing to be able to get data from a
website that you're browsing on a spreadsheet program like Excel (see Chapter 3,
Basic Crawling) in order to access it while you're offline or to perform calculations. As
a developer, you'll often wish to be able to combine data from various data sources,
but you are well aware of the complexities of retrieving or extracting them. Scrapy
can help you complete both easy and complex data extraction initiatives.
Scrapy is built upon years of experience in extracting massive amounts of data in a
robust and efficient manner. With Scrapy, you are able to do with a single setting
what would take various classes, plug-ins, and configuration in most other scraping
frameworks. A quick look at Chapter 7, Configuration and Management will make you
appreciate how much you can achieve in Scrapy with a few lines of configuration.
[1]
Introducing Scrapy
Chapter 1
Community
Scrapy has a vibrant community. Just have a look at the mailing list at
[3]
Introducing Scrapy
Be prepared to read this book several times. Maybe you can start by skimming
through it to understand its structure. Then read a chapter or two, learn, experiment
for a while, and then move further. Don't be afraid to skip a chapter if you feel
familiar with it. In particular, if you know HTML and XPath, there's no point
spending much time on Chapter 2, Understanding HTML and XPath. Don't worry;
this book still has plenty for you. Some chapters like Chapter 8, Programming
Scrapy combine the elements of a reference and a tutorial, and go in depth into
programming concepts. That's an example of a chapter one might like to read a few
times, while allowing a couple of weeks of Scrapy practice in between. You don't
need to perfectly master Chapter 8, Programming Scrapy before moving, for example,
to Chapter 9, Pipeline Recipes, which is full of applications. Reading the latter will help
you understand how to use the programming concepts, and if you wish, you can
reiterate as many times as you like.
We have tried to balance the pace to keep the book both interesting and beginnerfriendly. One thing we can't do though, is teach Python in this book. There are
several excellent books on the subject, but what I would recommend is trying a bit
more relaxed attitude while learning. One of the reasons Python is so popular is
that it's relatively simple, clean, and it reads well as English. Scrapy is a high-level
framework that requires learning from Python beginners and experts alike. You
could call it "the Scrapy language". As a result, I would recommend going through
the material, and if you feel that you find the Python syntax confusing, supplement
your learning with some of the excellent online Python tutorials or free Python
online courses for beginners at Coursera or elsewhere. Rest assured, you can be
quite a good Scrapy developer without being a Python expert.
[4]
Chapter 1
[5]
Introducing Scrapy
Some aspects of this process that are easy to overlook are very closely connected with
the data problems that Scrapy solves for us. When we ask potential customers to try
our mobile app, for example, we as developers or entrepreneurs ask them to judge
the functionality imagining how this app will look when completed. This might
be a bit too much imagining for a non-expert. The distance between an app which
shows "product 1", "product 2", and "user 433", and an application that provides
information on "Samsung UN55J6200 55-Inch TV", which has a five star rating from
user "Richard S." and working links that take you directly to a product detail page
(despite the fact we didn't write it), is significant. It's very difficult for people to
judge the functionality of an MVP objectively, unless the data that we use is realistic
and somewhat exciting.
One of the reasons that some start-ups have data as an afterthought is the perception
that collecting them is expensive. Indeed, we would typically need to develop forms,
administration screens, and spend time entering data or we could just use Scrapy
and crawl a few websites before writing even a single line of code. You will see in
Chapter 4, From Scrapy to a Mobile App, how easy it is to develop a simple mobile app
as soon as you have data.
[6]
Chapter 1
The idea of Google using forms might sound a bit ridiculous, but how many forms
does a typical website require a user to fill? A login form, a new listing form, a
checkout form, and so on. How much do those forms really cost by hindering
application's growth? If you know your audience/customers enough, it is highly
likely that you have a clue on the other websites they are typically using, and might
already have an account with. For example, a developer will likely have a Stack
Overflow and a GitHub account. Could youwith their permissionscrape those
sites as soon as they give you their username, and auto-fill their photos, their bio,
and a few recent posts? Can you perform some quick text analytics on the posts
they are mostly interested in, and use it to adapt your site's navigation structure
and suggested products or services? I hope you can see how replacing forms with
automated data scraping can allow you to better serve your audience, and grow at
web-scale.
[7]
Introducing Scrapy
Chapter 1
Scrapy is not Apache Nutch, that is, it's not a generic web crawler. If Scrapy visits a
website it knows nothing about, it won't be able to make anything meaningful out of
it. Scrapy is about extracting structured information, and requires manual effort to
set up the appropriate XPath or CSS expressions. Apache Nutch will take a generic
page and extract information, such as keywords, from it. It might be more suitable
for some applications and less for others.
Scrapy is not Apache Solr, Elasticsearch, or Lucene; in other words, it has nothing
to do with a search engine. Scrapy is not intended to give you references to the
documents that contain the word "Einstein" or anything else. You can use the data
extracted by Scrapy, and insert them into Solr or Elasticsearch as we do at the
beginning of Chapter 9, Pipeline Recipes, but that's just a way of using Scrapy, and not
something embedded into Scrapy.
Finally, Scrapy is not a database like MySQL, MongoDB, or Redis. It neither stores
nor indexes data. It only extracts data. That said, you will likely insert the data that
Scrapy extracts to a database, and there is support for many of them, which will
make your life easier. Scrapy isn't a database though, and its outputs could easily
be just files on a disk or even no output at allalthough I'm not sure how this
could be useful.
Summary
In this chapter, we introduced you to Scrapy, gave you an overview of what it can
help you with, and described what we believe is the best way to use this book. We
also presented several ways with which automated data scraping can benefit you
by helping you quickly develop high-quality applications that integrate nicely with
existing ecosystems. In the following chapter, we will introduce you to HTML and
XPath, two very important web languages that we will use in every Scrapy project.
[9]
www.PacktPub.com
Stay Connected: