Scraping a Table on https site using R

In this article, we have discussed the basics of web scraping using the R language and the rvest and tidyverse libraries. We have shown how to extract tables from a website and how to manipulate the resulting data. The examples in this article should provide a good starting point for anyone looking to scrape tables from websites. The following are the key concepts related to scraping tables in R:

Web scraping with R: R provides various libraries such as rvest and XML that can be used to extract data from websites.
Reading HTML: R can read HTML pages, and these pages can be parsed to extract the data we are interested in.
Selectors: To extract data from a website, we need to know the HTML structure of the page. Selectors in R allow us to select elements from the HTML page using CSS selectors or XPath.
Parsing HTML: After selecting the elements of interest, the next step is to parse the HTML content and extract the data.

Before we start scraping tables, the following prerequisites must be met, R should be installed on the system. The rvest library must be installed in R. If it's not installed, it can be installed by running the following command in the R console:

install.packages("rvest")

Scraping a Table from a Static Website

In this example, we use the read_html function to read the HTML content of the website. Then we use the html_nodes function to select the table using a CSS selector. Finally, we extract the table content using the html_table function and print the first six rows of the table.

library(rvest)

# Read the HTML content of the website
webpage <- read_html("https://en.wikipedia.org/wiki/%5C
                List_of_countries_by_GDP_(PPP)_per_capita")

# Select the table using CSS selector
table_node <- html_nodes(webpage, "table")

# Extract the table content
table_content <- html_table(table_node)[[2]]

# Print the table
head(table_content)

Output:

Scraping a Table from a Dynamic Website

Scraping a table from a dynamic website, which is generated using JavaScript. In this example, the rvest library is used to read the HTML code of the webpage and extract the table. The html_nodes function is used to select the first table on the page, and the html_table function is used to convert the HTML code into a DataFrame. Finally, the first few rows of the data frame are displayed using the head function.

library(rvest)
library(tidyverse)

# URL of the website
url <- "https://www.worldometers.info/world-population/%5C
                population-by-country/"

# Read the HTML code of the page
html_code <- read_html(url)

# Use the html_nodes function to extract the table
table_html <- html_code %>% html_nodes("table") %>% .[[1]]

# Use the html_table function to convert the table 
# HTML code into a data frame
table_df <- table_html %>% html_table()

# Inspect the first few rows of the data frame
head(table_df)

Output:

Scraping a Table on https site using R

Scraping a Table from a Static Website

Scraping a Table from a Dynamic Website

Explore