Scrape an HTML Table Using rvest in R
Last Updated :
29 Aug, 2024
Web scraping is a technique used to extract data from websites. In R, the rvest package is a popular tool for web scraping. It's easy to use and works well with most websites. This article helps you to process of scraping an HTML table using rvest.
What is rvest?
rvest is an R package that simplifies the process of web scraping. It allows us to easily extract data from web pages by converting HTML content into R data frames, which are easy to manipulate and analyze. rvest works seamlessly with other R packages, making it an essential tool for data collection from web pages.
What is HTML Tables?
HTML tables are used to display data in rows and columns on web pages. Each table has a <table> tag, and within it, rows are defined with <tr> tags. Each row contains cells, which are defined with <td> tags (for regular cells) or <th> tags (for header cells). Understanding the structure of HTML tables is crucial for correctly extracting data using web scraping techniques.
Now we implement stepwise Using rvest to Scrape an HTML Table in R Programming Language.
Step 1: Install and Load the Required Packages
First, ensure that you have the necessary packages installed and loaded.
R
# Install rvest and xml2 if you haven't already
install.packages("rvest")
install.packages("xml2")
# Load the packages
library(rvest)
library(xml2)
Step 2: Load the Web Page
Identify the URL of the web page from which you want to scrape the table.
R
# Specify the URL of the Wikipedia page
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
Step 3: Read the Web Page Content
Use the read_html() function to load the web page into R.
R
Step 4: Identify and Extract the Table
Locate the table in the HTML structure. You can use html_nodes() to find multiple tables and inspect them.
R
# Extract all tables from the page
tables <- page %>% html_nodes("table")
# Check the number of tables and preview their content
length(tables)
# Inspect the first few rows of each table to find the right one
for (i in seq_along(tables)) {
print(paste("Table", i))
print(head(html_table(tables[[i]], fill = TRUE)))
}
# Select the correct table (assumed to be the third table)
table <- tables[[3]] %>%
html_table(fill = TRUE)
Output:
[1] "Table 1"
# A tibble: 2 × 1
X1
<chr>
1 ""
2 "Largest economies in the world by GDP (nominal) in 2024according to Internat…
[1] "Table 2"
# A tibble: 1 × 3
X1 X2 X3
<chr> <chr> <chr>
1 .mw-parser-output .legend{page-break-inside:avoid;break-inside:av… $750… $50–…
[1] "Table 3"
# A tibble: 6 × 7
`Country/Territory` `IMF[1][13]` `IMF[1][13]` `World Bank[14]`
<chr> <chr> <chr> <chr>
1 Country/Territory Forecast Year Estimate
2 World 109,529,216 2024 105,435,540
3 United States 28,781,083 2024 27,360,935
4 China 18,532,633 [n 1]2024 17,794,782
5 Germany 4,591,100 2024 4,456,081
6 Japan 4,110,452 2024 4,212,945
# ℹ 3 more variables: `World Bank[14]` <chr>, `United Nations[15]` <chr>,
# `United Nations[15]` <chr>
[1] "Table 4"
# A tibble: 6 × 2
.mw-parser-output .navbar{display:inline;font-size:88…¹ .mw-parser-output .n…²
<chr> <chr>
1 Trade "Account balance\n% o…
2 Investment "FDI received\npast\n…
3 Funds "Forex reserves\nGold…
4 Budget and debt "Government budget\nP…
5 Income and taxes "Tax rates\nInheritan…
6 Bank rates "Central bank interes…
# ℹ abbreviated names:
# ¹​`.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .
# ²​`.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .
[1] "Table 5"
# A tibble: 6 × 2
`vteLists of countries by GDP rankings` vteLists of countries by GDP ranking…¹
<chr> <chr>
1 Nominal "Per capita\nPast and projected\nper …
2 Purchasing power parity(PPP) "Per capita\nPast\nper capita\nPast a…
3 Growth rate "African countries\nAsian states\nEur…
4 Gross national income (GNI) "PPP per capita\nNominal per capita\n…
5 Countries by region "Africa\nPPP\nnominal\nCommonwealth o…
6 Subnational divisions "Albania\nArgentina\nAustralia\nAustr…
# ℹ abbreviated name: ¹​`vteLists of countries by GDP rankings`
[1] "Table 6"
# A tibble: 6 × 6
vteEconomic classification of…¹ vteEconomic classifi…² `` `` `` ``
<chr> <chr> <chr> <chr> <chr> <chr>
1 "Developed country\nDeveloping… "Developed country\nD… <NA> <NA> <NA> <NA>
2 "Three/Four-World Model" "First World\nSecond … <NA> <NA> <NA> <NA>
3 "Gross domestic product (GDP)" "Nominal\nBy country\… Nomi… "By … Purc… "By …
4 "Nominal" "By country\npast and… <NA> <NA> <NA> <NA>
5 "Purchasing power parity (PPP… "By country\nfuture e… <NA> <NA> <NA> <NA>
6 "Gross national income (GNI)" "Nominal per capita\n… <NA> <NA> <NA> <NA>
# ℹ abbreviated names: ¹​`vteEconomic classification of countries`,
# ²​`vteEconomic classification of countries`
[1] "Table 7"
# A tibble: 2 × 2
X1 X2
<chr> <chr>
1 Nominal "By country\npast and projected\nper capita\np…
2 Purchasing power parity (PPP) "By country\nfuture estimates\nper capita\nper…
Step 5:Inspect the Structure of the Table
Before proceeding, inspect the structure of the extracted table to understand its layout and content.
R
# Check the structure of the table to understand its content
print(head(table))
Output:
# A tibble: 6 × 7
Rank `Country/Territory` `GDP (US$ million)` `IMF Year` `World Bank Estimate`
<chr> <chr> <dbl> <chr> <chr>
1 Coun… Forecast NA Estimate Year
2 World 109,529,216 2024 105,435,5… 2023
3 Unit… 28,781,083 2024 27,360,935 2023
4 China 18,532,633 NA 17,794,782 [n 3]2023
5 Germ… 4,591,100 2024 4,456,081 2023
6 Japan 4,110,452 2024 4,212,945 2023
# ℹ 2 more variables: `` <chr>, `` <chr>
Step 6: Data Cleaning
Remove any unwanted rows or columns, such as empty ones, to clean the table for further analysis, rename the columns to make them more descriptive. First, inspect the column names, then rename them accordingly.
R
# Remove empty rows and columns
table <- table[!apply(table, 1, function(row) all(is.na(row))), ]
table <- table[, colSums(!is.na(table)) > 0]
# Inspect column names
print(names(table))
# Rename columns
names(table) <- c("Rank", "Country/Territory", "GDP (US$ million)", "IMF Year",
"World Bank Estimate")
Output:
[1] "Country/Territory" "IMF[1][13]" "IMF[1][13]"
[4] "World Bank[14]" "World Bank[14]" "United Nations[15]"
[7] "United Nations[15]"
Step 7: Convert Data
Convert the GDP values to numeric data by removing commas, making the data easier to analyze.
R
# Convert GDP column to numeric
if ("GDP (US$ million)" %in% names(table)) {
table$`GDP (US$ million)` <- as.numeric(gsub(",", "", table$`GDP (US$ million)`))
}
Step 8: Check the Cleaned Table
Finally, inspect the cleaned table to ensure everything is correctly formatted.
R
# Check the cleaned table
print(head(table))
Output:
# A tibble: 6 × 7
Rank `Country/Territory` `GDP (US$ million)` `IMF Year` `World Bank Estimate` ``
<chr> <chr> <dbl> <chr> <chr> <chr>
1 Coun… Forecast NA Estimate Year Esti…
2 World 109,529,216 2024 105,435,5… 2023 100,…
3 Unit… 28,781,083 2024 27,360,935 2023 25,7…
4 China 18,532,633 NA 17,794,782 [n 3]2023 17,9…
5 Germ… 4,591,100 2024 4,456,081 2023 4,07…
6 Japan 4,110,452 2024 4,212,945 2023 4,23…
# ℹ 1 more variable: `` <chr>
Conclusion
Using rvest to scrape an HTML table is a straightforward process that involves loading the web page, identifying the table, and extracting it into a data frame. With just a few lines of code, you can gather data from the web and analyze it in R. Whether you're working with small tables or large datasets, rvest makes web scraping accessible and efficient.
Similar Reads
Scraping a Table on https site using R In this article, we have discussed the basics of web scraping using the R language and the rvest and tidyverse libraries. We have shown how to extract tables from a website and how to manipulate the resulting data. The examples in this article should provide a good starting point for anyone looking
3 min read
How to Scrape a Table with rvest and XPath Web Scraping is the process of extracting the data from websites. It can involves fetching the content of the webpage and parsing it to extract the necessary information. This technique can widely used for the data mining, content aggregation and data analysis. What is XPath?XPath (XML Path Language
3 min read
Scraping Text Inside Both a div and span Tag Using rvest HTML tags are used to give structure to the web page, HTML elements are the one that tells the browser how to display the content. Those elements contain many different types of tags. HTML tags are composed of an opening tag, content, and a closing tag. The Opening tag tells it is the beginning of a
9 min read
Extract Links Using rvest in R Web scraping refers to the automated retrieval of data from websites. For example, one common application in web scraping is hyperlink extraction. This has many uses including data collection for research, website crawling, and tracking website modifications over time. you can follow the structure o
4 min read
Read Html File In Python Using Pandas We are given an HTML file that contains one or more tables, and our task is to extract these tables as DataFrames using Python. For example, if we have an HTML file with a table like this:<table> <tr><th>Code</th><th>Language</th><th>Difficulty</th>
4 min read