0% found this document useful (1 vote)
232 views

Exemple-rapport-Stage-web-scraping-2022-ZD

Uploaded by

kossayboubaker45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
232 views

Exemple-rapport-Stage-web-scraping-2022-ZD

Uploaded by

kossayboubaker45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Summer Internship Report

Internship Thematic :
Contributing to the development of a website
data extractor based on scraping and crawling

Prepared By
Wissem Edine Triki

Supervised By
Mohamed Ben Sassi et Mhamed Chammem

Hosting company :

Internship period : 01 July 2024 – 30 September


Acknowledgements
First of all, I would like to express my sincere gratitude to my supervisors, Mr.
Mhamed Chammem and Mr. Mohamed Ben Sassi, who provided guidance and feedback
throughout this internship.
Furthermore I would like to thank all members of “F2P2N” for the friendly
environment and their support. I am also thankful to all of the professors and office members
of ESPRIT for all the considerate guidance and the quality of education.
Finally, I would also like to acknowledge everyone who helped throughout the
duration of this internship.

2
Summary
This internship’s goal, with “F2P2N”, is to create a web crawler in order to collect
data concerning Tunisian IT enterprises. In this report we present the process to create this
crawler during the period of the internship.

In the first chapter we discuss the hosting company, the problematic and goal along
with the requirements specification.

In the second chapter we discuss web scraping and its different aspects followed by
the reasoning behind the environment choices.

In the third chapter we discuss the planification process and the Gantt diagram that
shows the general tasks throughout the internship period.

In the fourth chapter we show the development process of the presented prototypes
and their results.

In the end we conclude the report, finalizing the internship.

3
Abstract

Data extraction is the process of acquiring data from a specific source and moving it
into a new environment, which can be on-premises, cloud-based, or a hybrid of the two.

Various strategies are employed for this purpose, which can be complex and are
often performed manually. Unless data is being extracted for archival purposes only, this is
usually the first step in the extraction, transformation, and loading ETL process. This means
that data is almost always further processed after initial retrieval to make it available for later
analysis.

4
Table of contents
Acknowledgements 2

Summary 3

Abstract 4

Table of contents 5

List of Figures 7

General introduction 8

CHAPTER I: Preliminary Study 9


1.1 Introduction 9
1.2 Hosting Company 9
1.3 Problematic & Goal 9
1.4 Requirements specification 10
1.4.1 Functional Requirements: 10
1.4.2 Non-Functional Requirements: 10

CHAPTER II: State of the art 11


2.1 Introduction 11
2.2 Web Scraping 11
2.2.1 Definition 11
2.2.2 The role of web scrapers 11
2.2.3 Application of web scraping tools 11
2.2.4 Self-built and existing extraction tools 11
2.2.5 Python scraping libraries 12

CHAPTER III: Planification 13


3.1 Introduction 13
3.2 Tools 13
3.2.1 VSCODE 13
3.2.2 Github 13
3.2.3 Scrapy 13
3.2.4 Selenium 14
3.3 Gantt Diagram 15

CHAPTER IV: Development 16


4.1 Introduction 16
4.2 Scrapy Prototype 16
4.3 Selenium Prototype 17

Conclusion 19

Bibliography 20

Annexes 21
ANNEX 1 : “Tunisie Industrie” target data 21

5
ANNEX 2 : “RNE” target data 22
Appendice 23
Web crawling 23
Search Indexing 23

6
List of Figures

■ Figure 1. : Py libraries.
■ Figure 2. : Selenium Logo
■ Figure 3 : Scrapy shell spider.
■ Figure 4 : Scrapy shell response.
■ Figure 5: Scrapy shell table content.
■ Figure 6 : Selenium table parse code sample.
■ Figure 7 : Selenium data scraping result.
■ Figure 8 : “Tunisie Industrie” Target Table.
■ Figure 9 : “RNE” Target table.

7
General introduction

In today's competitive world, everyone is looking for ways to innovate and take
advantage of new technologies. Web scraping (also known as web data extraction or data
scraping) provides a solution for those who wish to automate access to structured web data.
Web scraping is useful when the public website from which you want to retrieve data does
not have an API or has an API but provides limited access to the data.

Typically, web extracts are used by individuals and companies that want to make
better-informed decisions from the vast amounts of publicly available web data.

Web data extraction - also known as data scraping - has a wide range of applications.
Data scrapers can help you automate the process of extracting information from other
websites quickly and accurately. It also ensures that the data you extract is organized for
analysis and use in other projects.

8
CHAPTER I: Preliminary Study

1.1 Introduction

Throughout this chapter, we will present the hosting company, where this
internship was carried out and explain its areas of application and demonstrate what
problem was presented and the solutions proposed in order to deliver a suitable
application.

1.2 Hosting Company


Information and Communication Technologies have represented, in the last two
decades and in the world, the most important field of technical, scientific and economic
progress of the economic progress of the people. These technologies still represent the most
important indicator of the development and competitiveness of the developed and developing
countries on the international scene.

The digital economy, being the economic and social lever par excellence, is becoming
a potential that is difficult to ignore, for the very high added value of its products and for its
importance compared to other industries and sectors. The latter cannot develop without
technological and communication infrastructure and without a digital transformation strategy
and digital transformation strategy and digital products that help them to sustain their
products, stay ahead of them and improve their market position. This concerns production and
service companies, distribution and marketing channels, in addition to institutions,
organizations and in addition to institutions, organizations and public services.

In this context, the association "Forum for the Promotion of National Digital Product",
(F2P2N) is created as a scientific, cultural association, focused on the promotion of national
digital products.

1.3 Problematic & Goal


The prospect of extracting data sounds like a daunting task, it doesn’t have to be. In
fact, most companies and organizations now take advantage of data extraction tools to
manage the extraction process from end-to-end. Using an ETL tool automates and simplifies
the extraction process so that resources can be deployed toward other priorities.

The benefits of using a data extraction tool include: More control, Increased
agility, Simplified sharing, Accuracy and precision.

The interest of this internship is to contribute to the development of a data extraction


solution based on web scraping tools in python concerning Tunisian software firms in
technology and IT domains.

9
The main objectives are to deepen understanding of Internet crime topics, web
crawlers Technology and web crawlers, create web crawling scripts (spiders) to search and
extract data related to Enterprise Services, Products and Skills and record the results for
further processing.

1.4 Requirements specification

1.4.1 Functional Requirements:


Introducing the functional requirements, we concluded that the initial prototypes
would be simple to use as the user can launch the python file in order to start the crawler and
in turn obtain the json file.

1.4.2 Non-Functional Requirements:


Introducing the Non-functional Requirements we have determined a series of
criteria that should be respected, such as the prototype should be fast and responsive
and the code should be clean, commented and easy to maintain.

10
CHAPTER II: State of the art

2.1 Introduction
Throughout this chapter we will explain web scraping, its process and the variety of
python libraries available for web scraping tools.

2.2 Web Scraping

2.2.1 Definition
Web scraping is the extraction of data from a website. This information is The
collection is then exported in a format more useful to the user. either a spreadsheet or API.
While web scraping can be done manually, in most cases these are automated tools
Preferred when scraping web data as they can be cheaper and work faster. but in most cases,
web
scraping is not an easy task. Websites come in many forms, such as Therefore, the
functions and features of web scrapers vary.

2.2.2 The role of web scrapers


The role of an automated web scraper is fairly simple, but also complex. After all,
Websites are built for people, not machines. First, the web crawler Specifies one or more
URLs to load before crawling. Scraper then loads all HTML Code for the relevant page. More
advanced crawlers render the entire website, Includes CSS and JavaScript elements. The
scraper then further extracts all the data The page or specific data that the user selects before
running the project. Ideally, users will Go through the process of selecting the specific data
you want from the page. E.g, You might want to search Amazon product pages for prices
and models, but you're not Definitely interested in product reviews. Finally, the web crawler
outputs all the data Collected in a format that is more useful to the user.

2.2.3 Application of web scraping tools


There are a variety of uses you can gain from web scraping, including:

● Monitoring e-commerce prices


● Finding opportunities for investment
● Analyzing social media web data
● Applying machine learning techniques
● Gathering web data automatically
● Researching new concepts in a field
● Extracting contact information
● Monitoring news sources
● Generating sales leads

2.2.4 Self-built and existing extraction tools


Just like how anyone can build a website, anyone can build their own web scraper.
11
However, the tools available to build your own web scraper still require some
advanced programming knowledge. The scope of this knowledge also increases with the
number of features you’d like your scraper to have.

On the other hand, there are numerous pre-built web scrapers that you can
download and run right away. Some of these will also have advanced options added such as
scrape
scheduling, JSON and Google Sheets exports and more.

2.2.5 Python scraping libraries


Python offers a variety of libraries that one can use to scrape the web, such as
Scrapy, Beautiful Soup, Requests, Urllib, and Selenium. I am quite sure that more libraries
exist, and more will be released soon considering how popular Python is.

So what we need is an analysis of these libraries to make a choice that can adapt to
our project’s needs

Figure 1. : Py libraries.

12
CHAPTER III: Planification

3.1 Introduction

Throughout this chapter we will present the different tools needed in order to realize
the crawler prototype and the process behind it.

3.2 Tools

3.2.1 VSCODE

Visual Studio Code (known as VS Code) is a free, open-source text editor from
Microsoft. VS Code is available for Windows, Linux, and macOS. Although
the editor is relatively lightweight, it includes some powerful features that make
VS Code one of the most popular development environment tools these days.

3.2.2 Github

GitHub, Inc., is an Internet hosting service for software development and


version control using Git. It provides the distributed version control of Git plus
access control, bug tracking, software feature requests, task management,
continuous integration, and wikis for every project.

3.2.3 Scrapy

Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:

● Built-in support for selecting and extracting data from HTML/XML sources using
extended CSS selectors and XPath expressions, with helper methods to extract using
regular expressions.
● An interactive shell console (IPython aware) for trying out the CSS and XPath
expressions to scrape data, very useful when writing or debugging your
spiders.
● Built-in support for generating feed exports in multiple formats (JSON, CSV,
XML) and storing them in multiple backends (FTP, S3, local filesystem)
● Robust encoding support and auto-detection, for dealing with foreign, non-
standard and broken encoding declarations.
● Strong extensibility support, allowing you to plug in your own functionality
using signals and a well-defined API (middlewares, extensions, and pipelines).
● Wide range of built-in extensions and middlewares for handling:
○ cookies and session handling
○ HTTP features like compression, authentication, caching
13
○ user-agent spoofing
○ robots.txt
○ crawl depth restriction
○ and more
● A Telnet console for hooking into a Python console running inside your
Scrapy process, to introspect and debug your crawler
● Plus other goodies like reusable spiders to crawl sites from Sitemaps and
XML/CSV feeds, a media pipeline for automatically downloading images (or any
other media) associated with the scraped items, a caching DNS resolver, and much
more!

3.2.4 Selenium
Selenium is an open source umbrella project for a range of tools and libraries aimed at
supporting browser automation. It provides a playback tool for authoring functional tests
across most modern web browsers, without the need to learn a test scripting language.
● Open Source and Portable – Selenium is an open source and portable
Web testing Framework.
● Combination of tool and DSL – Selenium is combination of tools and
DSL (Domain Specific Language) in order to carry out various types of
tests.
● Easier to understand and implement – Selenium commands are categorized
in terms of different classes which make it easier to understand and
implement.
● Less burden and stress for testers – As mentioned above, the amount of
time required to do testing repeated test scenarios on each and every new
build is reduced to zero, almost. Hence, the burden of testers gets reduced.
● Cost reduction for the Business Clients – The Business needs to pay the
testers their salary, which is saved using automation testing tools. Automation
not only saves time but benefits the business.

14
Figure 2. : Selenium Logo

15
3.3 Gantt Diagram
Here we present the Gantt diagram containing the main tasks and their time frame
during the internship period.

16
CHAPTER IV: Development
4.1 Introduction
Throughout this chapter we will present the crawler prototypes realized and
the resulting scraping data.
In the Scrapy prototype we will follow the following procedure using a
spider: fetching target website, checking response validity and then filter target data
shown in ANNEX 2.
In the Selenium prototype we proceed to access the target website, simulate the
steps to reach the target table as shown in ANNEX 1, access each row information and
extract the full data in a json file.

4.2 Scrapy Prototype

Scrapy allows web scraping through creating a spider containing a parser that
extracts data from html responses

Figure 3 : Scrapy shell spider.


In Figure 6 we see the spider accessing the RNE [1] website and confirming the Crawled
State.

17
Figure 4 : Scrapy shell response.
In figure 7 we observe the spider in action through the html responses obtained.

Figure 5: Scrapy shell table content.

Due to the heavy javascript in the source websites, scrapy had issues accurately
scraping the target data and navigating the data sources as seen in figure 8.
That’s why we decided to opt out of scrapy in favor of Selenium, which is
more suitable to navigate javascript websites and access all target data more
efficiently.

4.3 Selenium Prototype

In the Selenium prototype we managed to go through the heavy javascript websites


and managed to reach paginated and href data sources.

18
Figure 6 : Selenium table parse code sample.

The code sample in figure 9 shows the process of scraping “Tunisie Industrie”
and how we go through the different elements.

Figure 7 : Selenium data scraping result.

As json is the best and fastest way to store data, figure 10 shows a sample of the
final resulting data from the final selenium prototype stored in a json file containing the
website’s name and the timestamp.

19
Conclusion

With the rapid development of the Internet, businesses are becoming more and more
dependent on data, and owning data on all aspects of the business is now a compulsion.

The benefits of web scraping and its processes have become an important aspect of all
decision-making processes for companies of all sizes. It is clear that web scraping software
tools are moving forward and will provide users with a competitive advantage.

In this report we discussed what web scraping is and fulfilled the internship using
scrapy, a python library, to create a web crawler with the role of collecting data related
Tunisian IT enterprises and recording the results for further processing.

20
Bibliography

● [1] National Enterprise Register : https://www.registre-entreprises.tn/rne-public/


● [2] E-RNE : https://e-rne.tn/
● [3] Scrapy documentation : https://docs.scrapy.org/en/latest/
● [4] Selenium documentation : https://www.selenium.dev/documentation/
● [5] Python Selenium : https://ledatascientist.com/web-scraping-python-avec-selenium/
● [6] Python web scraping repo : https://github.com/topics/python-web-scraper
● [7] Bo Zhao - University of Washington Seattle - Web Scraping
https://www.researchgate.net/publication/317177787_Web_Scraping

21
Annexes
ANNEX 1 : “Tunisie Industrie” target data [

Figure 8 : “Tunisie Industrie” Target Table.

22
ANNEX 2 : “RNE” target data[1][2]

Figure 9 : “RNE” Target table.

23
Appendice

Web crawling

A web crawler, sometimes called a spider or spider bot, often abbreviated as a


crawler, is an Internet robot that systematically crawls the World Wide Web, usually operated
by search engines, for web indexing (web crawlers).

Web search engines and some other websites use web crawler or spider software to
update their web content or the web content index of other websites. Web crawlers duplicate
pages for processing by search engines, which index downloaded pages so users can search
more efficiently.

Crawlers consume resources on visited systems and often visit unsolicited websites.
Scheduling, usage, and "politeness" issues come into play when visiting a large number of
pages. For public sites that do not want to be crawled, there is a mechanism to let the crawler
agent know. For example, including a robots.txt file can prompt robots to index only parts of
your site or not at all.

Search Indexing

A search index is like creating a library card catalog for the web, so search engines
know where on the web to pull information when people search for it. It can also be likened
to an index at the end of a book, which lists all the places in the book where a particular topic
or phrase is mentioned.

The index focuses on the text displayed on the page and the page metadata* that is not
visible to the user. When most search engines index a page, they add all the words on the
page to their index—except Google's words like "a," "an," and "the." When a user searches
for these terms, the search engine traverses the index of all pages containing these terms and
selects the most relevant pages.

24
In the context of a search index, metadata is data that tells search engines what a web
page is about. Often meta titles and meta descriptions appear on search engine results pages
rather than user-visible web content.

Big Data

Big data refers to datasets that are too large or complex to be processed by traditional
data processing applications. Data with many fields (rows) provides greater statistical power,
while data with higher complexity (more attributes or columns) may result in higher false
detection rates.

Big data analytics challenges include data collection, data storage, data analysis,
search, sharing, transfer, visualization, query, update, privacy, and data provenance. Big data
was originally associated with three key concepts: volume, variety, and velocity.

Big data analysis poses sampling challenges, so so far only observation and sampling
have been allowed. So the fourth concept, authenticity, has to do with the quality or insight of
the data. Without adequate investment in big data authenticity expertise, the volume and
variety of data can create costs and risks that exceed an organization's ability to create and
capture value from big data.

Data Extraction

Data extraction is the process of collecting or retrieving different types of data from
various sources, many of which may be poorly organized or completely unstructured. Data
extraction makes it possible to consolidate, process and optimize data so that it can be stored
in a central location for transformation. These locations can be on-premises, cloud-based, or a
mix of both.

Data extraction is the first step in the ETL (Extract, Transform, Load) and ELT
(Extract, Load, Transform) process. ETL/ELT itself is part of a complete data integration
strategy.

Web scraping legality


Although web scraping is a powerful technique in collecting large data sets, it is
controversial and may raise legal questions related to copyright (O’Reilly 2006), terms of
service (ToS) (Fisher et al. 2010), and “trespass to chattels”(Hirschey 2014).

25
A web scraper is free to copy a piece of data in figure or table form from a web page
without any copyright infringement because it is difficult to prove a copyright over such data
since only a specific arrangement or a particular selection of the data is legally protected.
Regarding the ToS, although most web applications include some form of ToS agreement,
their enforceability usually lies within a gray area. For instance, the owner of a web scraper
that violates the ToS may argue that he or she never saw or officially agreed to the ToS.
Moreover, if a web scraper sends data acquiring requests too frequently, this is functionally
equivalent to a denial-of-service attack, in which the web scraper owner may be refused entry
and may be liable for damages under the law of “trespass to chattels,”because the owner
of the web application has a property interest in the physical web server which hosts the
application. An ethical web scraping tool will avoid this issue by maintaining a reasonable
requesting frequency. [7]

26

You might also like