0% found this document useful (0 votes)

16 views5 pages

Web Scraping 101

Web scraping is the automated process of extracting data from websites, often used for data mining, price monitoring, market research, and content aggregation. The process involves data extraction, automated tools, parsing, and storage, while also considering legal and ethical guidelines. Puppeteer is a popular Node.js library for web scraping that offers features like headless browsing, full browser control, and JavaScript rendering capabilities.

Uploaded by

ermias70ne

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views5 pages

Web Scraping 101

Uploaded by

ermias70ne

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Web Scraping Fundamentals

What is Web Scraping?

Web scraping is the process of automatically extracting data from websites. It
involves fetching the content of a web page and parsing it to collect specific
information.

1993 World Wide Web Wanderer for indexing website links

2004 The first library for web scraping in Python - Beautiful Soup

Common Use Cases

Data Mining: Collecting data for analysis, research, or machine learning.

Price Monitoring: Tracking prices and availability of products across different

e-commerce sites.

Market Research: Gathering insights about competitors, trends, and customer

opinions from forums and reviews.

Content Aggregation: Compiling information from multiple sources into a

single platform, such as news articles or job listings.

Steps to Scrap a Web Page

Data Extraction:

The primary goal is gathering data from web pages, including text, images,
links, and other elements.

Automated Tools:

Web scraping is typically performed using automated tools or scripts,

which can navigate websites, simulate user behavior, and extract data
without manual intervention.

Web Scraping Fundamentals 1

Parsing:

After fetching the HTML content of a page, the next step is to parse it to
identify and extract the desired information. This often involves using
libraries or frameworks that can navigate the HTML structure.

Storage:

Once the data is extracted, it can be stored in various formats, such as

CSV files, databases, or spreadsheets, for further analysis or processing.

Legal and Ethical Considerations

Respect Robots.txt: Many websites have a robots.txt file that specifies rules
about what can be scraped. Always check and comply with these rules.

Terms of Service: Scraping may violate a website's terms of service. Be sure

to review and adhere to them.

Rate Limiting: To avoid overloading a server, it’s important to implement rate

limiting and avoid making too many requests in a short period.

Puppeteer
Puppeteer is a Node.js library developed by Google that provides a high-level API
to control headless Chrome or Chromium browsers. It's widely used for web
scraping due to its powerful capabilities.

Key Features of Puppeteer for Web Scraping

1. Headless Browsing:

Puppeteer can run Chrome in headless mode, meaning it can perform web
scraping without opening a visible browser window. This makes it more
efficient and faster for automated tasks.

2. Full Browser Control:

Puppeteer allows you to control nearly all aspects of the browser,

including navigation, clicking elements, filling forms, and taking
screenshots. This makes it suitable for scraping complex web applications.

3. JavaScript Rendering:

Web Scraping Fundamentals 2

Many modern websites rely on JavaScript to render content. Puppeteer
can execute JavaScript on pages, which allows you to scrape dynamic
content that might not be available in the initial HTML.

4. Easy Navigation:

Puppeteer provides straightforward methods for navigating to pages,

waiting for elements to load, and handling timeouts, which simplifies the
scraping process.

5. Data Extraction:

You can easily extract data from the DOM using methods to query
elements, retrieve text content, and get attribute values.

6. Screenshots and PDFs:

Puppeteer can take screenshots of pages or generate PDFs, which can be

useful for visual verification of scraped content.

Installation
Link for Puppeteer Library on NPM - https://www.npmjs.com/package/puppeteer

npm i puppeteer # Downloads compatible Chrome during installation.

npm i puppeteer-core # Alternatively, install as a library, without downloading Ch

When you install puppeteer-core, you need to specify an executable path for
Chrome or Chromium.

Windows: Typically located at:

Chrome: C:\Program Files\Google\Chrome\Application\chrome.exe

Chromium: C:\Users\<YourUsername>\AppData\Local\Chromium\Application\chrome.exe

macOS: Typically located at:

Chrome: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome

Chromium: ~/Applications/Chromium.app/Contents/MacOS/Chromium

Linux: Usually installed via package managers, often at:

Web Scraping Fundamentals 3

Chrome: /usr/bin/google-chrome

Chromium: /usr/bin/chromium-browser

Classes inside Puppeteer Library

Browser - This instance represents a browser session and allows you to
perform various operations, such as opening new pages, closing the browser,
and managing browser contexts.

Page - The Page object represents a single tab or page in the browser. When
you create a new page using the newPage() method on a Browser instance, you
receive a Page instance. This object allows you to interact with the content of
the page, perform actions, and extract data.

Navigation:

Methods like goto(URL) enable you to navigate to a specific URL.

Content Interaction:

You can perform actions like clicking buttons, filling out forms, and
navigating through links using methods such as click(selector) , type(selector,

text) , and evaluate() .

Data Extraction:

The Page object allows you to extract content from the DOM. You can
use evaluate() to run JavaScript in the context of the page and return

data.

Event Handling:

You can listen to various events on the page, such as load ,

domcontentloaded , and more.

Screenshots and PDFs:

You can take screenshots of the page or generate PDFs using

screenshot() and pdf() methods.

Evaluate Method

Web Scraping Fundamentals 4

The evaluate method of the Page object in Puppeteer takes a function as an
argument, which is executed in the context of the page. This means that the
function you provide will run later, once the page has loaded or when the evaluate
method is called.

Callbacks

Basic Callback

Asynchronous Callbacks

Array Method Callbacks

Web Scraping Fundamentals 5

Avance - CS2 - (MANUAL DE SERVICIO)
100% (15)
Avance - CS2 - (MANUAL DE SERVICIO)
486 pages
SkinCare Recommendation System Using Computer Vision
No ratings yet
SkinCare Recommendation System Using Computer Vision
16 pages
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
No ratings yet
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
4 pages
Mongoose for Application Development
From Everand
Mongoose for Application Development
Simon Holmes
4.5/5 (2)
Text-Processing-For-NLP-Web-Scrapping (5)
No ratings yet
Text-Processing-For-NLP-Web-Scrapping (5)
18 pages
Scraping
100% (1)
Scraping
25 pages
Web Scraping, Web Harvesting, or Web Data Extraction Is
No ratings yet
Web Scraping, Web Harvesting, or Web Data Extraction Is
1 page
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
The Oracle Universal Content Management Handbook: Build, administer, and manage Oracle Stellent UCM Solutions
From Everand
The Oracle Universal Content Management Handbook: Build, administer, and manage Oracle Stellent UCM Solutions
Dmitri Khanine
5/5 (1)
Web Scraping
No ratings yet
Web Scraping
5 pages
Implementation of Web Application For Disease Prediction Using AI
No ratings yet
Implementation of Web Application For Disease Prediction Using AI
5 pages
The Ultimate Django Guide: From Beginner to Advanced Web Development
From Everand
The Ultimate Django Guide: From Beginner to Advanced Web Development
Jiho Seok
No ratings yet
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
Practical Plone 3: A Beginner's Guide to Building Powerful Websites
From Everand
Practical Plone 3: A Beginner's Guide to Building Powerful Websites
Alex Clark
No ratings yet
Practical Web Scraping for Economists 1744341390
No ratings yet
Practical Web Scraping for Economists 1744341390
33 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Introduction To Web Scraping
100% (1)
Introduction To Web Scraping
3 pages
The A-Z of Web Scraping in 2020 (A How-To Guide)
No ratings yet
The A-Z of Web Scraping in 2020 (A How-To Guide)
18 pages
DeVito_et_al_2020_how_we_learnt_to_stop_worrying_and
No ratings yet
DeVito_et_al_2020_how_we_learnt_to_stop_worrying_and
3 pages
Data Collection
No ratings yet
Data Collection
10 pages
Module-4
No ratings yet
Module-4
14 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Web Devlopment
From Everand
Web Devlopment
Netra
No ratings yet
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Web Scraping with Python Step by Step: A Practical Guide with Examples
From Everand
Web Scraping with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Performance Tools
From Everand
Performance Tools
Ahmed Bouchefra
No ratings yet
Building Websites with OpenCms
From Everand
Building Websites with OpenCms
Matt Butcher
No ratings yet
Plone 3 Intranets
From Everand
Plone 3 Intranets
Victor Fernandez de Alba
No ratings yet
20_BeautifulSoup Library for Web Scraping
No ratings yet
20_BeautifulSoup Library for Web Scraping
12 pages
Aprende programación python aplicaciones web: python, #2
From Everand
Aprende programación python aplicaciones web: python, #2
Jesus Jonathan cuevas orozco
No ratings yet
Assignment 5 - Text Web and Social Media Analytics
No ratings yet
Assignment 5 - Text Web and Social Media Analytics
2 pages
Mastering Node.js
From Everand
Mastering Node.js
Sandro Pasquali
1/5 (2)
B42_IP105__S1_D2
No ratings yet
B42_IP105__S1_D2
4 pages
Semin
No ratings yet
Semin
8 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Scraping
No ratings yet
Web Scraping
11 pages
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
No ratings yet
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
4 pages
Bypass and Automate The Landing Page by Only Writing One Script
No ratings yet
Bypass and Automate The Landing Page by Only Writing One Script
1 page
chp3A10.10072F978 3 319 32001 4 - 483 1
No ratings yet
chp3A10.10072F978 3 319 32001 4 - 483 1
4 pages
Django 1.0 Web Site Development
From Everand
Django 1.0 Web Site Development
Ayman Hourieh
4/5 (2)
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
MooTools 1.2 Beginner's Guide
From Everand
MooTools 1.2 Beginner's Guide
Jacob Gube
No ratings yet
Web Automation Scraping JS Handbook Small Size
No ratings yet
Web Automation Scraping JS Handbook Small Size
19 pages
Diouf 2019
No ratings yet
Diouf 2019
3 pages
tools
No ratings yet
tools
1 page
AI-Powered Web Scraping in 2024: Best Practices & Use Cases
No ratings yet
AI-Powered Web Scraping in 2024: Best Practices & Use Cases
5 pages
I) Web Crawling: Yash Pahlani D17B 49
No ratings yet
I) Web Crawling: Yash Pahlani D17B 49
7 pages
Web Crawler
0% (1)
Web Crawler
16 pages
Alfresco Developer Guide
From Everand
Alfresco Developer Guide
Jeff Potts
No ratings yet
ir5
No ratings yet
ir5
18 pages
Web Data Extractors
No ratings yet
Web Data Extractors
26 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Mastering JavaScript Single Page Application Development
From Everand
Mastering JavaScript Single Page Application Development
Philip Klauzinski
No ratings yet
Building Websites with VB.NET and DotNetNuke 4
From Everand
Building Websites with VB.NET and DotNetNuke 4
Daniel N. Egan
1/5 (1)
Web Scrapping
No ratings yet
Web Scrapping
15 pages
Web Scraping With Python Tutorials From A To Z
100% (1)
Web Scraping With Python Tutorials From A To Z
35 pages
Com 059
No ratings yet
Com 059
6 pages
DAP_4_module
No ratings yet
DAP_4_module
45 pages
2341-Wica Ad 2.24-11C1 Iac Ils Loc Rwy14
No ratings yet
2341-Wica Ad 2.24-11C1 Iac Ils Loc Rwy14
1 page
Social Sustainability and Redevelopment of Urban Villages in China: A Case Study of Guangzhou
No ratings yet
Social Sustainability and Redevelopment of Urban Villages in China: A Case Study of Guangzhou
18 pages
Yahoo Case Study
100% (2)
Yahoo Case Study
30 pages
Annotated Bibliography
No ratings yet
Annotated Bibliography
8 pages
Cyberscope Audit
No ratings yet
Cyberscope Audit
20 pages
Basic Histology Text & Atlas 11th EditionT
No ratings yet
Basic Histology Text & Atlas 11th EditionT
513 pages
Introduction To Volcanoes and Their Types
No ratings yet
Introduction To Volcanoes and Their Types
2 pages
TS2-700-ZE-fast-manual
No ratings yet
TS2-700-ZE-fast-manual
2 pages
CU Registristion Forn Fillup CCF 2024
No ratings yet
CU Registristion Forn Fillup CCF 2024
1 page
GT-C3322 SVCM
No ratings yet
GT-C3322 SVCM
72 pages
3 Tyre Types 160217043028
No ratings yet
3 Tyre Types 160217043028
23 pages
SBM Level 1 Developing 1 Cfss Points 25: Qualitative Interpretation
No ratings yet
SBM Level 1 Developing 1 Cfss Points 25: Qualitative Interpretation
2 pages
the-design-thinking-workbook-parikh-en-46474
No ratings yet
the-design-thinking-workbook-parikh-en-46474
6 pages
Chapter 4 - Car M
No ratings yet
Chapter 4 - Car M
239 pages
Final Ge Nine Cell Matrix
100% (1)
Final Ge Nine Cell Matrix
18 pages
MIL Lesson Plan
No ratings yet
MIL Lesson Plan
3 pages
Sentinel Power: 5-6 kVA 6.5-10 kVA
No ratings yet
Sentinel Power: 5-6 kVA 6.5-10 kVA
4 pages
Simulatingcombinedcyclegasturbinepowerplantsin Aspen HYSYS
No ratings yet
Simulatingcombinedcyclegasturbinepowerplantsin Aspen HYSYS
14 pages
f4649504 Using Mixed Methods Approach To Enhance and Validate Your Research PDF
No ratings yet
f4649504 Using Mixed Methods Approach To Enhance and Validate Your Research PDF
82 pages
Falcon Series Web
No ratings yet
Falcon Series Web
24 pages
Alive in The Desert - The Complete Guide For Desert Recreation and Survival - Joe Kraus - Paladin Press
No ratings yet
Alive in The Desert - The Complete Guide For Desert Recreation and Survival - Joe Kraus - Paladin Press
123 pages
Mobile App Development - Welcome To Tech-Smart World
No ratings yet
Mobile App Development - Welcome To Tech-Smart World
5 pages
Power Correlation For Anchor and Helical Ribbon Impellers in Highly Viscous Liquids
100% (1)
Power Correlation For Anchor and Helical Ribbon Impellers in Highly Viscous Liquids
4 pages
5 Academic Honesty PPT and Workshop
No ratings yet
5 Academic Honesty PPT and Workshop
30 pages
SS21 Lesson 5 Physical & Digital Self
No ratings yet
SS21 Lesson 5 Physical & Digital Self
14 pages
Math Worksheet
No ratings yet
Math Worksheet
6 pages
Isre 2400
No ratings yet
Isre 2400
8 pages
4.evidence On Relocation From Field Research
No ratings yet
4.evidence On Relocation From Field Research
9 pages