Web Scraping 101
Web Scraping 101
2004 The first library for web scraping in Python - Beautiful Soup
The primary goal is gathering data from web pages, including text, images,
links, and other elements.
Automated Tools:
After fetching the HTML content of a page, the next step is to parse it to
identify and extract the desired information. This often involves using
libraries or frameworks that can navigate the HTML structure.
Storage:
Puppeteer
Puppeteer is a Node.js library developed by Google that provides a high-level API
to control headless Chrome or Chromium browsers. It's widely used for web
scraping due to its powerful capabilities.
Puppeteer can run Chrome in headless mode, meaning it can perform web
scraping without opening a visible browser window. This makes it more
efficient and faster for automated tasks.
3. JavaScript Rendering:
4. Easy Navigation:
5. Data Extraction:
You can easily extract data from the DOM using methods to query
elements, retrieve text content, and get attribute values.
Installation
Link for Puppeteer Library on NPM - https://www.npmjs.com/package/puppeteer
When you install puppeteer-core, you need to specify an executable path for
Chrome or Chromium.
Chromium: C:\Users\<YourUsername>\AppData\Local\Chromium\Application\chrome.exe
Chromium: ~/Applications/Chromium.app/Contents/MacOS/Chromium
Chromium: /usr/bin/chromium-browser
Page - The Page object represents a single tab or page in the browser. When
you create a new page using the newPage() method on a Browser instance, you
receive a Page instance. This object allows you to interact with the content of
the page, perform actions, and extract data.
Navigation:
Content Interaction:
You can perform actions like clicking buttons, filling out forms, and
navigating through links using methods such as click(selector) , type(selector,
Data Extraction:
The Page object allows you to extract content from the DOM. You can
use evaluate() to run JavaScript in the context of the page and return
data.
Event Handling:
Evaluate Method
Callbacks
Basic Callback
Asynchronous Callbacks