Exemple-rapport-Stage-web-scraping-2022-ZD
Exemple-rapport-Stage-web-scraping-2022-ZD
Internship Thematic :
Contributing to the development of a website
data extractor based on scraping and crawling
Prepared By
Wissem Edine Triki
Supervised By
Mohamed Ben Sassi et Mhamed Chammem
Hosting company :
2
Summary
This internship’s goal, with “F2P2N”, is to create a web crawler in order to collect
data concerning Tunisian IT enterprises. In this report we present the process to create this
crawler during the period of the internship.
In the first chapter we discuss the hosting company, the problematic and goal along
with the requirements specification.
In the second chapter we discuss web scraping and its different aspects followed by
the reasoning behind the environment choices.
In the third chapter we discuss the planification process and the Gantt diagram that
shows the general tasks throughout the internship period.
In the fourth chapter we show the development process of the presented prototypes
and their results.
3
Abstract
Data extraction is the process of acquiring data from a specific source and moving it
into a new environment, which can be on-premises, cloud-based, or a hybrid of the two.
Various strategies are employed for this purpose, which can be complex and are
often performed manually. Unless data is being extracted for archival purposes only, this is
usually the first step in the extraction, transformation, and loading ETL process. This means
that data is almost always further processed after initial retrieval to make it available for later
analysis.
4
Table of contents
Acknowledgements 2
Summary 3
Abstract 4
Table of contents 5
List of Figures 7
General introduction 8
Conclusion 19
Bibliography 20
Annexes 21
ANNEX 1 : “Tunisie Industrie” target data 21
5
ANNEX 2 : “RNE” target data 22
Appendice 23
Web crawling 23
Search Indexing 23
6
List of Figures
■ Figure 1. : Py libraries.
■ Figure 2. : Selenium Logo
■ Figure 3 : Scrapy shell spider.
■ Figure 4 : Scrapy shell response.
■ Figure 5: Scrapy shell table content.
■ Figure 6 : Selenium table parse code sample.
■ Figure 7 : Selenium data scraping result.
■ Figure 8 : “Tunisie Industrie” Target Table.
■ Figure 9 : “RNE” Target table.
7
General introduction
In today's competitive world, everyone is looking for ways to innovate and take
advantage of new technologies. Web scraping (also known as web data extraction or data
scraping) provides a solution for those who wish to automate access to structured web data.
Web scraping is useful when the public website from which you want to retrieve data does
not have an API or has an API but provides limited access to the data.
Typically, web extracts are used by individuals and companies that want to make
better-informed decisions from the vast amounts of publicly available web data.
Web data extraction - also known as data scraping - has a wide range of applications.
Data scrapers can help you automate the process of extracting information from other
websites quickly and accurately. It also ensures that the data you extract is organized for
analysis and use in other projects.
8
CHAPTER I: Preliminary Study
1.1 Introduction
Throughout this chapter, we will present the hosting company, where this
internship was carried out and explain its areas of application and demonstrate what
problem was presented and the solutions proposed in order to deliver a suitable
application.
The digital economy, being the economic and social lever par excellence, is becoming
a potential that is difficult to ignore, for the very high added value of its products and for its
importance compared to other industries and sectors. The latter cannot develop without
technological and communication infrastructure and without a digital transformation strategy
and digital transformation strategy and digital products that help them to sustain their
products, stay ahead of them and improve their market position. This concerns production and
service companies, distribution and marketing channels, in addition to institutions,
organizations and in addition to institutions, organizations and public services.
In this context, the association "Forum for the Promotion of National Digital Product",
(F2P2N) is created as a scientific, cultural association, focused on the promotion of national
digital products.
The benefits of using a data extraction tool include: More control, Increased
agility, Simplified sharing, Accuracy and precision.
9
The main objectives are to deepen understanding of Internet crime topics, web
crawlers Technology and web crawlers, create web crawling scripts (spiders) to search and
extract data related to Enterprise Services, Products and Skills and record the results for
further processing.
10
CHAPTER II: State of the art
2.1 Introduction
Throughout this chapter we will explain web scraping, its process and the variety of
python libraries available for web scraping tools.
2.2.1 Definition
Web scraping is the extraction of data from a website. This information is The
collection is then exported in a format more useful to the user. either a spreadsheet or API.
While web scraping can be done manually, in most cases these are automated tools
Preferred when scraping web data as they can be cheaper and work faster. but in most cases,
web
scraping is not an easy task. Websites come in many forms, such as Therefore, the
functions and features of web scrapers vary.
On the other hand, there are numerous pre-built web scrapers that you can
download and run right away. Some of these will also have advanced options added such as
scrape
scheduling, JSON and Google Sheets exports and more.
So what we need is an analysis of these libraries to make a choice that can adapt to
our project’s needs
Figure 1. : Py libraries.
12
CHAPTER III: Planification
3.1 Introduction
Throughout this chapter we will present the different tools needed in order to realize
the crawler prototype and the process behind it.
3.2 Tools
3.2.1 VSCODE
Visual Studio Code (known as VS Code) is a free, open-source text editor from
Microsoft. VS Code is available for Windows, Linux, and macOS. Although
the editor is relatively lightweight, it includes some powerful features that make
VS Code one of the most popular development environment tools these days.
3.2.2 Github
3.2.3 Scrapy
Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:
● Built-in support for selecting and extracting data from HTML/XML sources using
extended CSS selectors and XPath expressions, with helper methods to extract using
regular expressions.
● An interactive shell console (IPython aware) for trying out the CSS and XPath
expressions to scrape data, very useful when writing or debugging your
spiders.
● Built-in support for generating feed exports in multiple formats (JSON, CSV,
XML) and storing them in multiple backends (FTP, S3, local filesystem)
● Robust encoding support and auto-detection, for dealing with foreign, non-
standard and broken encoding declarations.
● Strong extensibility support, allowing you to plug in your own functionality
using signals and a well-defined API (middlewares, extensions, and pipelines).
● Wide range of built-in extensions and middlewares for handling:
○ cookies and session handling
○ HTTP features like compression, authentication, caching
13
○ user-agent spoofing
○ robots.txt
○ crawl depth restriction
○ and more
● A Telnet console for hooking into a Python console running inside your
Scrapy process, to introspect and debug your crawler
● Plus other goodies like reusable spiders to crawl sites from Sitemaps and
XML/CSV feeds, a media pipeline for automatically downloading images (or any
other media) associated with the scraped items, a caching DNS resolver, and much
more!
3.2.4 Selenium
Selenium is an open source umbrella project for a range of tools and libraries aimed at
supporting browser automation. It provides a playback tool for authoring functional tests
across most modern web browsers, without the need to learn a test scripting language.
● Open Source and Portable – Selenium is an open source and portable
Web testing Framework.
● Combination of tool and DSL – Selenium is combination of tools and
DSL (Domain Specific Language) in order to carry out various types of
tests.
● Easier to understand and implement – Selenium commands are categorized
in terms of different classes which make it easier to understand and
implement.
● Less burden and stress for testers – As mentioned above, the amount of
time required to do testing repeated test scenarios on each and every new
build is reduced to zero, almost. Hence, the burden of testers gets reduced.
● Cost reduction for the Business Clients – The Business needs to pay the
testers their salary, which is saved using automation testing tools. Automation
not only saves time but benefits the business.
14
Figure 2. : Selenium Logo
15
3.3 Gantt Diagram
Here we present the Gantt diagram containing the main tasks and their time frame
during the internship period.
16
CHAPTER IV: Development
4.1 Introduction
Throughout this chapter we will present the crawler prototypes realized and
the resulting scraping data.
In the Scrapy prototype we will follow the following procedure using a
spider: fetching target website, checking response validity and then filter target data
shown in ANNEX 2.
In the Selenium prototype we proceed to access the target website, simulate the
steps to reach the target table as shown in ANNEX 1, access each row information and
extract the full data in a json file.
Scrapy allows web scraping through creating a spider containing a parser that
extracts data from html responses
17
Figure 4 : Scrapy shell response.
In figure 7 we observe the spider in action through the html responses obtained.
Due to the heavy javascript in the source websites, scrapy had issues accurately
scraping the target data and navigating the data sources as seen in figure 8.
That’s why we decided to opt out of scrapy in favor of Selenium, which is
more suitable to navigate javascript websites and access all target data more
efficiently.
18
Figure 6 : Selenium table parse code sample.
The code sample in figure 9 shows the process of scraping “Tunisie Industrie”
and how we go through the different elements.
As json is the best and fastest way to store data, figure 10 shows a sample of the
final resulting data from the final selenium prototype stored in a json file containing the
website’s name and the timestamp.
19
Conclusion
With the rapid development of the Internet, businesses are becoming more and more
dependent on data, and owning data on all aspects of the business is now a compulsion.
The benefits of web scraping and its processes have become an important aspect of all
decision-making processes for companies of all sizes. It is clear that web scraping software
tools are moving forward and will provide users with a competitive advantage.
In this report we discussed what web scraping is and fulfilled the internship using
scrapy, a python library, to create a web crawler with the role of collecting data related
Tunisian IT enterprises and recording the results for further processing.
20
Bibliography
21
Annexes
ANNEX 1 : “Tunisie Industrie” target data [
22
ANNEX 2 : “RNE” target data[1][2]
23
Appendice
Web crawling
Web search engines and some other websites use web crawler or spider software to
update their web content or the web content index of other websites. Web crawlers duplicate
pages for processing by search engines, which index downloaded pages so users can search
more efficiently.
Crawlers consume resources on visited systems and often visit unsolicited websites.
Scheduling, usage, and "politeness" issues come into play when visiting a large number of
pages. For public sites that do not want to be crawled, there is a mechanism to let the crawler
agent know. For example, including a robots.txt file can prompt robots to index only parts of
your site or not at all.
Search Indexing
A search index is like creating a library card catalog for the web, so search engines
know where on the web to pull information when people search for it. It can also be likened
to an index at the end of a book, which lists all the places in the book where a particular topic
or phrase is mentioned.
The index focuses on the text displayed on the page and the page metadata* that is not
visible to the user. When most search engines index a page, they add all the words on the
page to their index—except Google's words like "a," "an," and "the." When a user searches
for these terms, the search engine traverses the index of all pages containing these terms and
selects the most relevant pages.
24
In the context of a search index, metadata is data that tells search engines what a web
page is about. Often meta titles and meta descriptions appear on search engine results pages
rather than user-visible web content.
Big Data
Big data refers to datasets that are too large or complex to be processed by traditional
data processing applications. Data with many fields (rows) provides greater statistical power,
while data with higher complexity (more attributes or columns) may result in higher false
detection rates.
Big data analytics challenges include data collection, data storage, data analysis,
search, sharing, transfer, visualization, query, update, privacy, and data provenance. Big data
was originally associated with three key concepts: volume, variety, and velocity.
Big data analysis poses sampling challenges, so so far only observation and sampling
have been allowed. So the fourth concept, authenticity, has to do with the quality or insight of
the data. Without adequate investment in big data authenticity expertise, the volume and
variety of data can create costs and risks that exceed an organization's ability to create and
capture value from big data.
Data Extraction
Data extraction is the process of collecting or retrieving different types of data from
various sources, many of which may be poorly organized or completely unstructured. Data
extraction makes it possible to consolidate, process and optimize data so that it can be stored
in a central location for transformation. These locations can be on-premises, cloud-based, or a
mix of both.
Data extraction is the first step in the ETL (Extract, Transform, Load) and ELT
(Extract, Load, Transform) process. ETL/ELT itself is part of a complete data integration
strategy.
25
A web scraper is free to copy a piece of data in figure or table form from a web page
without any copyright infringement because it is difficult to prove a copyright over such data
since only a specific arrangement or a particular selection of the data is legally protected.
Regarding the ToS, although most web applications include some form of ToS agreement,
their enforceability usually lies within a gray area. For instance, the owner of a web scraper
that violates the ToS may argue that he or she never saw or officially agreed to the ToS.
Moreover, if a web scraper sends data acquiring requests too frequently, this is functionally
equivalent to a denial-of-service attack, in which the web scraper owner may be refused entry
and may be liable for damages under the law of “trespass to chattels,”because the owner
of the web application has a property interest in the physical web server which hosts the
application. An ethical web scraping tool will avoid this issue by maintaining a reasonable
requesting frequency. [7]
26