Open In App

Extract Links Using rvest in R

Last Updated : 09 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Web scraping refers to the automated retrieval of data from websites. For example, one common application in web scraping is hyperlink extraction. This has many uses including data collection for research, website crawling, and tracking website modifications over time. you can follow the structure of a website and accumulate pertinent information by extracting links.

What is rvest and Why Use It?

R is a programming language that has developed an excellent package called rvest which can facilitate scraping data from the web. It also makes it easy to download websites or create your own. Rvest uses two packages XML and httr where it builds its power by considering them as extra packages for convenience in web scraping tasks. This way, users can go through CSS selectors and access HTML elements easily.

Basics of HTML Structure

The standard language for creating web pages is HTML (Hyper Text Markup Language). It is important to understand the basics of HTML for web scraping. An HTML document has a structure made up of elements and attributes.

  • Elements: Represented by tags such as '<a>', '<div>', '<p>', etc.. Elements can be included in other elements.
  • Attributes: It adds more information to elements. For ex: The 'href' attribute in an '<a>' tag shows the address of the link for instance.

HTML link looks like this:

<a href="https://example.com">Example</a>

here '<a>' is the anchor tag, 'href' is an attribute that contains the URL, and "Example" is the text of the link.

Using rvest to Extract Links

The package rvest in R allows for web scraping with ease. It allows for web pages downloading and information extraction with utmost simplicity. This is an API that simplifies website scraping processes using XML and httr libraries as its base.

In advance, it's important to note that you must have the rvest package installed if you are intending on using its features. Adding onto that, this can be achieved by executing these instructions on your R workstation:

install.packages("rvest")

After installing, the package should be loaded into your R session:

library(rvest)

Now we will discuss step by step implementation of extract links using rvest in R Programming Language.

Step 1: Reading the HTML Content of a Webpage

Firstly, specify the URL of web page that you want to scrape and read its HTML content:

R
#specify the URL of the webpage
url <- "https://geeksforgeeks.com"

#Read HTML content of the webpage
webpage <- read_html(url)

Step 2: Selecting the Relevant Nodes that contain the links

Select the relevant nodes that contains the links. In HTML, links are usually contained within '<a>' tags:

R
#select the relevant nodes that contain the links
link_nodes <- webpage %>% html_nodes("a")

Step 3: Extracting the URLs from the selected nodes

Finally, extract the URLs from the selected nodes using the 'html_attr' function:

R
#Extract all links from the webpage.
links <- webpage %>%
  html_nodes("a") %>%
  html_attr("href")

#print the extracted links.
print(links)

Output:

  [1] "https://www.geeksforgeeks.org/"                                                                                                                                            
[2] "https://www.geeksforgeeks.org/trending/"
[3] "https://www.geeksforgeeks.org/learn-data-structures-and-algorithms-dsa-tutorial/"
[4] "https://www.geeksforgeeks.org/web-technology/"
[5] "https://www.geeksforgeeks.org/courses/category/programming-languages.....................................................

In the above program, for demonstration purpose the url is said to be geeksforgeeks.org , you can replace it with any web page URL you wish to scrape.

  • To begin with, make sure to include the rvest package so that you can access the functions that this package provides.
  • Next step would be to indicate the URL by replacing "https://www.geeksforgeeks.org" with any web page URL you wish to scrape.
  • You should now proceed and read the HTML content by using read_html(url), which downloads and parses HTML content of your desired web page.
  • For purpose of extracting all relevant links html_nodes("a"), which selects all <a> tags with their relevant attributes.
  • Finally, we can print out our newly extracted links from the web page.

Common Pitfalls

  • Invalid URLs: Always make sure that the URL you are trying to access is valid and reachable, or it may show an error when it fails to return anything.
  • Missing Attributes: It might happen that some <a> tags lack href attributes leading to NA outputs.
  • Relative URLs: Some links can also be relative instead of absolute, therefore, they may require conversion into absolute URLs with base URL.
  • Robots.txt: The robots.txt document provided by every W3C web site indicate which mining operations within their domain should take place.
  • Website Terms of Service: Finally before scraping any web data try checking on terms of service from their respective owners.

Conclusion

By using the rvest tool in R, it is possible to simply take out hyperlinks from any site. Text, images or tables are other examples of data that can be collected through basic web scraping tasks like this one. It's only by mastering these basics that more complex web scraping methods can be constructed for automating the collection and analysis of information.


Next Article

Similar Reads