0% found this document useful (0 votes)
16 views5 pages

Web Scraping 101

Web scraping is the automated process of extracting data from websites, often used for data mining, price monitoring, market research, and content aggregation. The process involves data extraction, automated tools, parsing, and storage, while also considering legal and ethical guidelines. Puppeteer is a popular Node.js library for web scraping that offers features like headless browsing, full browser control, and JavaScript rendering capabilities.

Uploaded by

ermias70ne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

Web Scraping 101

Web scraping is the automated process of extracting data from websites, often used for data mining, price monitoring, market research, and content aggregation. The process involves data extraction, automated tools, parsing, and storage, while also considering legal and ethical guidelines. Puppeteer is a popular Node.js library for web scraping that offers features like headless browsing, full browser control, and JavaScript rendering capabilities.

Uploaded by

ermias70ne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Web Scraping Fundamentals

What is Web Scraping?


Web scraping is the process of automatically extracting data from websites. It
involves fetching the content of a web page and parsing it to collect specific
information.

1993 World Wide Web Wanderer for indexing website links

2004 The first library for web scraping in Python - Beautiful Soup

Common Use Cases


Data Mining: Collecting data for analysis, research, or machine learning.

Price Monitoring: Tracking prices and availability of products across different


e-commerce sites.

Market Research: Gathering insights about competitors, trends, and customer


opinions from forums and reviews.

Content Aggregation: Compiling information from multiple sources into a


single platform, such as news articles or job listings.

Steps to Scrap a Web Page


Data Extraction:

The primary goal is gathering data from web pages, including text, images,
links, and other elements.

Automated Tools:

Web scraping is typically performed using automated tools or scripts,


which can navigate websites, simulate user behavior, and extract data
without manual intervention.

Web Scraping Fundamentals 1


Parsing:

After fetching the HTML content of a page, the next step is to parse it to
identify and extract the desired information. This often involves using
libraries or frameworks that can navigate the HTML structure.

Storage:

Once the data is extracted, it can be stored in various formats, such as


CSV files, databases, or spreadsheets, for further analysis or processing.

Legal and Ethical Considerations


Respect Robots.txt: Many websites have a robots.txt file that specifies rules
about what can be scraped. Always check and comply with these rules.

Terms of Service: Scraping may violate a website's terms of service. Be sure


to review and adhere to them.

Rate Limiting: To avoid overloading a server, it’s important to implement rate


limiting and avoid making too many requests in a short period.

Puppeteer
Puppeteer is a Node.js library developed by Google that provides a high-level API
to control headless Chrome or Chromium browsers. It's widely used for web
scraping due to its powerful capabilities.

Key Features of Puppeteer for Web Scraping


1. Headless Browsing:

Puppeteer can run Chrome in headless mode, meaning it can perform web
scraping without opening a visible browser window. This makes it more
efficient and faster for automated tasks.

2. Full Browser Control:

Puppeteer allows you to control nearly all aspects of the browser,


including navigation, clicking elements, filling forms, and taking
screenshots. This makes it suitable for scraping complex web applications.

3. JavaScript Rendering:

Web Scraping Fundamentals 2


Many modern websites rely on JavaScript to render content. Puppeteer
can execute JavaScript on pages, which allows you to scrape dynamic
content that might not be available in the initial HTML.

4. Easy Navigation:

Puppeteer provides straightforward methods for navigating to pages,


waiting for elements to load, and handling timeouts, which simplifies the
scraping process.

5. Data Extraction:

You can easily extract data from the DOM using methods to query
elements, retrieve text content, and get attribute values.

6. Screenshots and PDFs:

Puppeteer can take screenshots of pages or generate PDFs, which can be


useful for visual verification of scraped content.

Installation
Link for Puppeteer Library on NPM - https://www.npmjs.com/package/puppeteer

npm i puppeteer # Downloads compatible Chrome during installation.


npm i puppeteer-core # Alternatively, install as a library, without downloading Ch

When you install puppeteer-core, you need to specify an executable path for
Chrome or Chromium.

Windows: Typically located at:

Chrome: C:\Program Files\Google\Chrome\Application\chrome.exe

Chromium: C:\Users\<YourUsername>\AppData\Local\Chromium\Application\chrome.exe

macOS: Typically located at:

Chrome: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome

Chromium: ~/Applications/Chromium.app/Contents/MacOS/Chromium

Linux: Usually installed via package managers, often at:

Web Scraping Fundamentals 3


Chrome: /usr/bin/google-chrome

Chromium: /usr/bin/chromium-browser

Classes inside Puppeteer Library


Browser - This instance represents a browser session and allows you to
perform various operations, such as opening new pages, closing the browser,
and managing browser contexts.

Page - The Page object represents a single tab or page in the browser. When
you create a new page using the newPage() method on a Browser instance, you
receive a Page instance. This object allows you to interact with the content of
the page, perform actions, and extract data.

Navigation:

Methods like goto(URL) enable you to navigate to a specific URL.

Content Interaction:

You can perform actions like clicking buttons, filling out forms, and
navigating through links using methods such as click(selector) , type(selector,

text) , and evaluate() .

Data Extraction:

The Page object allows you to extract content from the DOM. You can
use evaluate() to run JavaScript in the context of the page and return

data.

Event Handling:

You can listen to various events on the page, such as load ,


domcontentloaded , and more.

Screenshots and PDFs:

You can take screenshots of the page or generate PDFs using


screenshot() and pdf() methods.

Evaluate Method

Web Scraping Fundamentals 4


The evaluate method of the Page object in Puppeteer takes a function as an
argument, which is executed in the context of the page. This means that the
function you provide will run later, once the page has loaded or when the evaluate
method is called.

Callbacks

Basic Callback

Asynchronous Callbacks

Array Method Callbacks

Web Scraping Fundamentals 5

You might also like