Open In App

Scrape an HTML Table Using rvest in R

Last Updated : 29 Aug, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Web scraping is a technique used to extract data from websites. In R, the rvest package is a popular tool for web scraping. It's easy to use and works well with most websites. This article helps you to process of scraping an HTML table using rvest.

What is rvest?

rvest is an R package that simplifies the process of web scraping. It allows us to easily extract data from web pages by converting HTML content into R data frames, which are easy to manipulate and analyze. rvest works seamlessly with other R packages, making it an essential tool for data collection from web pages.

What is HTML Tables?

HTML tables are used to display data in rows and columns on web pages. Each table has a <table> tag, and within it, rows are defined with <tr> tags. Each row contains cells, which are defined with <td> tags (for regular cells) or <th> tags (for header cells). Understanding the structure of HTML tables is crucial for correctly extracting data using web scraping techniques.

Now we implement stepwise Using rvest to Scrape an HTML Table in R Programming Language.

Step 1: Install and Load the Required Packages

First, ensure that you have the necessary packages installed and loaded.

R
# Install rvest and xml2 if you haven't already
install.packages("rvest")
install.packages("xml2")

# Load the packages
library(rvest)
library(xml2)

Step 2: Load the Web Page

Identify the URL of the web page from which you want to scrape the table.

R
# Specify the URL of the Wikipedia page
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

Step 3: Read the Web Page Content

Use the read_html() function to load the web page into R.

R
page <- read_html(url)

Step 4: Identify and Extract the Table

Locate the table in the HTML structure. You can use html_nodes() to find multiple tables and inspect them.

R
# Extract all tables from the page
tables <- page %>% html_nodes("table")
# Check the number of tables and preview their content
length(tables)

# Inspect the first few rows of each table to find the right one
for (i in seq_along(tables)) {
  print(paste("Table", i))
  print(head(html_table(tables[[i]], fill = TRUE)))
}
# Select the correct table (assumed to be the third table)
table <- tables[[3]] %>% 
  html_table(fill = TRUE)

Output:

[1] "Table 1"
# A tibble: 2 × 1
X1
<chr>
1 ""
2 "Largest economies in the world by GDP (nominal) in 2024according to Internat…
[1] "Table 2"
# A tibble: 1 × 3
X1 X2 X3
<chr> <chr> <chr>
1 .mw-parser-output .legend{page-break-inside:avoid;break-inside:av… $750… $50–…
[1] "Table 3"
# A tibble: 6 × 7
`Country/Territory` `IMF[1][13]` `IMF[1][13]` `World Bank[14]`
<chr> <chr> <chr> <chr>
1 Country/Territory Forecast Year Estimate
2 World 109,529,216 2024 105,435,540
3 United States 28,781,083 2024 27,360,935
4 China 18,532,633 [n 1]2024 17,794,782
5 Germany 4,591,100 2024 4,456,081
6 Japan 4,110,452 2024 4,212,945
# ℹ 3 more variables: `World Bank[14]` <chr>, `United Nations[15]` <chr>,
# `United Nations[15]` <chr>
[1] "Table 4"
# A tibble: 6 × 2
.mw-parser-output .navbar{display:inline;font-size:88…¹ .mw-parser-output .n…²
<chr> <chr>
1 Trade "Account balance\n% o…
2 Investment "FDI received\npast\n…
3 Funds "Forex reserves\nGold…
4 Budget and debt "Government budget\nP…
5 Income and taxes "Tax rates\nInheritan…
6 Bank rates "Central bank interes…
# ℹ abbreviated names:
# ¹​`.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .
# ²​`.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .
[1] "Table 5"
# A tibble: 6 × 2
`vteLists of countries by GDP rankings` vteLists of countries by GDP ranking…¹
<chr> <chr>
1 Nominal "Per capita\nPast and projected\nper …
2 Purchasing power parity(PPP) "Per capita\nPast\nper capita\nPast a…
3 Growth rate "African countries\nAsian states\nEur…
4 Gross national income (GNI) "PPP per capita\nNominal per capita\n…
5 Countries by region "Africa\nPPP\nnominal\nCommonwealth o…
6 Subnational divisions "Albania\nArgentina\nAustralia\nAustr…
# ℹ abbreviated name: ¹​`vteLists of countries by GDP rankings`
[1] "Table 6"
# A tibble: 6 × 6
vteEconomic classification of…¹ vteEconomic classifi…² `` `` `` ``
<chr> <chr> <chr> <chr> <chr> <chr>
1 "Developed country\nDeveloping… "Developed country\nD… <NA> <NA> <NA> <NA>
2 "Three/Four-World Model" "First World\nSecond … <NA> <NA> <NA> <NA>
3 "Gross domestic product (GDP)" "Nominal\nBy country\… Nomi… "By … Purc… "By …
4 "Nominal" "By country\npast and… <NA> <NA> <NA> <NA>
5 "Purchasing power parity (PPP… "By country\nfuture e… <NA> <NA> <NA> <NA>
6 "Gross national income (GNI)" "Nominal per capita\n… <NA> <NA> <NA> <NA>
# ℹ abbreviated names: ¹​`vteEconomic classification of countries`,
# ²​`vteEconomic classification of countries`
[1] "Table 7"
# A tibble: 2 × 2
X1 X2
<chr> <chr>
1 Nominal "By country\npast and projected\nper capita\np…
2 Purchasing power parity (PPP) "By country\nfuture estimates\nper capita\nper…

Step 5:Inspect the Structure of the Table

Before proceeding, inspect the structure of the extracted table to understand its layout and content.

R
# Check the structure of the table to understand its content
print(head(table))

Output:

# A tibble: 6 × 7
Rank `Country/Territory` `GDP (US$ million)` `IMF Year` `World Bank Estimate`
<chr> <chr> <dbl> <chr> <chr>
1 Coun… Forecast NA Estimate Year
2 World 109,529,216 2024 105,435,5… 2023
3 Unit… 28,781,083 2024 27,360,935 2023
4 China 18,532,633 NA 17,794,782 [n 3]2023
5 Germ… 4,591,100 2024 4,456,081 2023
6 Japan 4,110,452 2024 4,212,945 2023
# ℹ 2 more variables: `` <chr>, `` <chr>

Step 6: Data Cleaning

Remove any unwanted rows or columns, such as empty ones, to clean the table for further analysis, rename the columns to make them more descriptive. First, inspect the column names, then rename them accordingly.

R
# Remove empty rows and columns
table <- table[!apply(table, 1, function(row) all(is.na(row))), ]
table <- table[, colSums(!is.na(table)) > 0]
# Inspect column names
print(names(table))

# Rename columns
names(table) <- c("Rank", "Country/Territory", "GDP (US$ million)", "IMF Year", 
                  "World Bank Estimate")

Output:

[1] "Country/Territory"  "IMF[1][13]"         "IMF[1][13]"        
[4] "World Bank[14]" "World Bank[14]" "United Nations[15]"
[7] "United Nations[15]"

Step 7: Convert Data

Convert the GDP values to numeric data by removing commas, making the data easier to analyze.

R
# Convert GDP column to numeric
if ("GDP (US$ million)" %in% names(table)) {
  table$`GDP (US$ million)` <- as.numeric(gsub(",", "", table$`GDP (US$ million)`))
}

Step 8: Check the Cleaned Table

Finally, inspect the cleaned table to ensure everything is correctly formatted.

R
# Check the cleaned table
print(head(table))

Output:

# A tibble: 6 × 7
Rank `Country/Territory` `GDP (US$ million)` `IMF Year` `World Bank Estimate` ``
<chr> <chr> <dbl> <chr> <chr> <chr>
1 Coun… Forecast NA Estimate Year Esti…
2 World 109,529,216 2024 105,435,5… 2023 100,…
3 Unit… 28,781,083 2024 27,360,935 2023 25,7…
4 China 18,532,633 NA 17,794,782 [n 3]2023 17,9…
5 Germ… 4,591,100 2024 4,456,081 2023 4,07…
6 Japan 4,110,452 2024 4,212,945 2023 4,23…
# ℹ 1 more variable: `` <chr>

Conclusion

Using rvest to scrape an HTML table is a straightforward process that involves loading the web page, identifying the table, and extracting it into a data frame. With just a few lines of code, you can gather data from the web and analyze it in R. Whether you're working with small tables or large datasets, rvest makes web scraping accessible and efficient.


Next Article

Similar Reads