Scrape an HTML Table Using rvest in R

Last Updated : 29 Aug, 2024

Web scraping is a technique used to extract data from websites. In R, the rvest package is a popular tool for web scraping. It's easy to use and works well with most websites. This article helps you to process of scraping an HTML table using rvest.

What is rvest?

rvest is an R package that simplifies the process of web scraping. It allows us to easily extract data from web pages by converting HTML content into R data frames, which are easy to manipulate and analyze. rvest works seamlessly with other R packages, making it an essential tool for data collection from web pages.

What is HTML Tables?

HTML tables are used to display data in rows and columns on web pages. Each table has a <table> tag, and within it, rows are defined with <tr> tags. Each row contains cells, which are defined with <td> tags (for regular cells) or <th> tags (for header cells). Understanding the structure of HTML tables is crucial for correctly extracting data using web scraping techniques.

Now we implement stepwise Using rvest to Scrape an HTML Table in R Programming Language.

Step 1: Install and Load the Required Packages

First, ensure that you have the necessary packages installed and loaded.

# Install rvest and xml2 if you haven't already
install.packages("rvest")
install.packages("xml2")

# Load the packages
library(rvest)
library(xml2)

Step 2: Load the Web Page

Identify the URL of the web page from which you want to scrape the table.

# Specify the URL of the Wikipedia page
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

Step 3: Read the Web Page Content

Use the read_html() function to load the web page into R.

page <- read_html(url)

Step 4: Identify and Extract the Table

Locate the table in the HTML structure. You can use html_nodes() to find multiple tables and inspect them.

# Extract all tables from the page
tables <- page %>% html_nodes("table")
# Check the number of tables and preview their content
length(tables)

# Inspect the first few rows of each table to find the right one
for (i in seq_along(tables)) {
  print(paste("Table", i))
  print(head(html_table(tables[[i]], fill = TRUE)))
}
# Select the correct table (assumed to be the third table)
table <- tables[[3]] %>% 
  html_table(fill = TRUE)

Output:

[1] "Table 1"
# A tibble: 2 × 1
  X1                                                                            
  <chr>                                                                         
1 ""                                                                            
2 "Largest economies in the world by GDP (nominal) in 2024according to Internat…
[1] "Table 2"
# A tibble: 1 × 3
  X1                                                                 X2    X3   
  <chr>                                                              <chr> <chr>
1 .mw-parser-output .legend{page-break-inside:avoid;break-inside:av… $750… $50–…
[1] "Table 3"
# A tibble: 6 × 7
  `Country/Territory` `IMF[1][13]` `IMF[1][13]` `World Bank[14]`
  <chr>               <chr>        <chr>        <chr>           
1 Country/Territory   Forecast     Year         Estimate        
2 World               109,529,216  2024         105,435,540     
3 United States       28,781,083   2024         27,360,935      
4 China               18,532,633   [n 1]2024    17,794,782      
5 Germany             4,591,100    2024         4,456,081       
6 Japan               4,110,452    2024         4,212,945       
# ℹ 3 more variables: `World Bank[14]` <chr>, `United Nations[15]` <chr>,
#   `United Nations[15]` <chr>
[1] "Table 4"
# A tibble: 6 × 2
  .mw-parser-output .navbar{display:inline;font-size:88…¹ .mw-parser-output .n…²
  <chr>                                                   <chr>                 
1 Trade                                                   "Account balance\n% o…
2 Investment                                              "FDI received\npast\n…
3 Funds                                                   "Forex reserves\nGold…
4 Budget and debt                                         "Government budget\nP…
5 Income and taxes                                        "Tax rates\nInheritan…
6 Bank rates                                              "Central bank interes…
# ℹ abbreviated names:
#   ¹`.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .
#   ²`.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .
[1] "Table 5"
# A tibble: 6 × 2
  `vteLists of countries by GDP rankings` vteLists of countries by GDP ranking…¹
  <chr>                                   <chr>                                 
1 Nominal                                 "Per capita\nPast and projected\nper …
2 Purchasing power parity(PPP)            "Per capita\nPast\nper capita\nPast a…
3 Growth rate                             "African countries\nAsian states\nEur…
4 Gross national income (GNI)             "PPP per capita\nNominal per capita\n…
5 Countries by region                     "Africa\nPPP\nnominal\nCommonwealth o…
6 Subnational divisions                   "Albania\nArgentina\nAustralia\nAustr…
# ℹ abbreviated name: ¹`vteLists of countries by GDP rankings`
[1] "Table 6"
# A tibble: 6 × 6
  vteEconomic classification of…¹ vteEconomic classifi…² ``    ``    ``    ``   
  <chr>                           <chr>                  <chr> <chr> <chr> <chr>
1 "Developed country\nDeveloping… "Developed country\nD… <NA>   <NA> <NA>   <NA>
2 "Three/Four-World Model"        "First World\nSecond … <NA>   <NA> <NA>   <NA>
3 "Gross domestic product (GDP)"  "Nominal\nBy country\… Nomi… "By … Purc… "By …
4 "Nominal"                       "By country\npast and… <NA>   <NA> <NA>   <NA>
5 "Purchasing  power parity (PPP… "By country\nfuture e… <NA>   <NA> <NA>   <NA>
6 "Gross national income (GNI)"   "Nominal per capita\n… <NA>   <NA> <NA>   <NA>
# ℹ abbreviated names: ¹`vteEconomic classification of countries`,
#   ²`vteEconomic classification of countries`
[1] "Table 7"
# A tibble: 2 × 2
  X1                             X2                                             
  <chr>                          <chr>                                          
1 Nominal                        "By country\npast and projected\nper capita\np…
2 Purchasing  power parity (PPP) "By country\nfuture estimates\nper capita\nper…

Step 5:Inspect the Structure of the Table

Before proceeding, inspect the structure of the extracted table to understand its layout and content.

# Check the structure of the table to understand its content
print(head(table))

Output:

# A tibble: 6 × 7
  Rank  `Country/Territory` `GDP (US$ million)` `IMF Year` `World Bank Estimate`
  <chr> <chr>                             <dbl> <chr>      <chr>                
1 Coun… Forecast                             NA Estimate   Year                 
2 World 109,529,216                        2024 105,435,5… 2023                 
3 Unit… 28,781,083                         2024 27,360,935 2023                 
4 China 18,532,633                           NA 17,794,782 [n 3]2023            
5 Germ… 4,591,100                          2024 4,456,081  2023                 
6 Japan 4,110,452                          2024 4,212,945  2023                 
# ℹ 2 more variables: `` <chr>, `` <chr>

Step 6: Data Cleaning

Remove any unwanted rows or columns, such as empty ones, to clean the table for further analysis, rename the columns to make them more descriptive. First, inspect the column names, then rename them accordingly.

# Remove empty rows and columns
table <- table[!apply(table, 1, function(row) all(is.na(row))), ]
table <- table[, colSums(!is.na(table)) > 0]
# Inspect column names
print(names(table))

# Rename columns
names(table) <- c("Rank", "Country/Territory", "GDP (US$ million)", "IMF Year", 
                  "World Bank Estimate")

Output:

[1] "Country/Territory"  "IMF[1][13]"         "IMF[1][13]"        
[4] "World Bank[14]"     "World Bank[14]"     "United Nations[15]"
[7] "United Nations[15]"

Step 7: Convert Data

Convert the GDP values to numeric data by removing commas, making the data easier to analyze.

# Convert GDP column to numeric
if ("GDP (US$ million)" %in% names(table)) {
  table$`GDP (US$ million)` <- as.numeric(gsub(",", "", table$`GDP (US$ million)`))
}

Step 8: Check the Cleaned Table

Finally, inspect the cleaned table to ensure everything is correctly formatted.

# Check the cleaned table
print(head(table))

Output:

# A tibble: 6 × 7
  Rank  `Country/Territory` `GDP (US$ million)` `IMF Year` `World Bank Estimate` ``   
  <chr> <chr>                             <dbl> <chr>      <chr>                 <chr>
1 Coun… Forecast                             NA Estimate   Year                  Esti…
2 World 109,529,216                        2024 105,435,5… 2023                  100,…
3 Unit… 28,781,083                         2024 27,360,935 2023                  25,7…
4 China 18,532,633                           NA 17,794,782 [n 3]2023             17,9…
5 Germ… 4,591,100                          2024 4,456,081  2023                  4,07…
6 Japan 4,110,452                          2024 4,212,945  2023                  4,23…
# ℹ 1 more variable: `` <chr>

Conclusion

Using rvest to scrape an HTML table is a straightforward process that involves loading the web page, identifying the table, and extracting it into a data frame. With just a few lines of code, you can gather data from the web and analyze it in R. Whether you're working with small tables or large datasets, rvest makes web scraping accessible and efficient.

Scrape an HTML Table Using rvest in R

pmishra01

Improve

Article Tags :

Scrape an HTML Table Using rvest in R

What is rvest?

What is HTML Tables?

Step 1: Install and Load the Required Packages

Step 2: Load the Web Page

Step 3: Read the Web Page Content

Step 4: Identify and Extract the Table

Step 5:Inspect the Structure of the Table

Step 6: Data Cleaning

Step 7: Convert Data

Step 8: Check the Cleaned Table

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?