How to Extract Node with Same Label in R

Last Updated : 08 Aug, 2024

XML (eXtensible Markup Language) is the markup language that defines the set of rules for encoding documents in a human-readable and machine-readable format. The XML document can be structured with nested elements called nodes. These nodes contain the text attributes and other nested nodes. in this article, we will discuss How to Extract Nodes with the Same Label in the R Programming Language.

What is Node?

Node is An element within the document structure, which may contain tags, attributes, and text. For example, in HTML, <div>, <p>, and <a> are nodes. now we will create one XML Structure.

<books>
  <book>
    <title>C Programming</title>
    <author>Author A</author>
  </book>
  <book>
    <title>Java</title>
    <author>Author B</author>
  </book>
</books>'

In this example, <books> is the root node and it can contains the multiple <book> nodes. Each <book> node contains the <title> and <author> nodes of the XML file.

Node Extraction Using XML Package

The xml2 package in R is the powerful tool for parsing and extracting the data from XML documents.

Step 1: Install and load the xml2 package

We can install it using the below command.

install.packages("xml2")

library(xml2)

To work with the XML nodes in the R, we first need to load the XML file and parse it into the R object. Here the how you can do that:

# Load and parse the XML file
xml_file <- read_xml("path_to_your_file.xml")

We can extract the nodes with the specific label and use the xml_find_all function from the xml2 package. This function can allows you to find all the nodes that match the given XPath expression.

# Install and load necessary packages
library(xml2)

# Path to your XML file
file_path <- "C:\\Users\\GFG19565\\Downloads\\sa1.xml"

# Read and parse the XML file
xml_doc <- read_xml(file_path)

# Print the XML content (for debugging purposes)
print(xml_doc)

# Extract all nodes with the same label (e.g., label="A")
nodes <- xml_find_all(xml_doc, "//item[@label='A']")

# Display the content of these nodes
node_texts <- xml_text(nodes)
print(node_texts)

Output:

<book><title>Python</title><author>Author C</author></book>

Handling the Multiple Nodes

When you have multiple the node with the same label, we may want to extract the specific information from each one. We can loop through the extracted the nodes and access their child nodes or attributes.

Step 1: Create the books.xml File

We can create the simple xml file contains the multiple books names and it can be saved as books.xml.

XML

<books>
  <book>
    <title>Learning Python</title>
    <author>Mark Lutz</author>
    <year>2013</year>
    <publisher>O'Reilly Media</publisher>
  </book>
  <book>
    <title>Effective Java</title>
    <author>Joshua Bloch</author>
    <year>2018</year>
    <publisher>Addison-Wesley</publisher>
  </book>
  <book>
    <title>Clean Code</title>
    <author>Robert C. Martin</author>
    <year>2008</year>
    <publisher>Prentice Hall</publisher>
  </book>
  <book>
    <title>The Pragmatic Programmer</title>
    <author>Andrew Hunt, David Thomas</author>
    <year>1999</year>
    <publisher>Addison-Wesley</publisher>
  </book>

</books>

Step 2: R Code Implementation

We can use the below code to extract the node with same label in R.

library(xml2)

# Load and parse the XML file
xml_file <- read_xml("C:/Users/Syam/OneDrive/Desktop/PHP Projects/books.xml")

# Extract all <book> nodes
book_nodes <- xml_find_all(xml_file, "//book")

# Print the extracted nodes
print(book_nodes)

# Loop through each <book> node and extract information
for (book in book_nodes) {
  title <- xml_text(xml_find_first(book, "title"))
  author <- xml_text(xml_find_first(book, "author"))
  year <- xml_text(xml_find_first(book, "year"))
  publisher <- xml_text(xml_find_first(book, "publisher"))
  
  cat("Title:", title, "\n")
  cat("Author:", author, "\n")
  cat("Year:", year, "\n")
  cat("Publisher:", publisher, "\n\n")
}

Output:

{xml_nodeset (4)}
[1] <book>\n  <title>Learning Python</title>\n  <author>Mark Lutz</author>\n  <year ...
[2] <book>\n  <title>Effective Java</title>\n  <author>Joshua Bloch</author>\n  <ye ...
[3] <book>\n  <title>Clean Code</title>\n  <author>Robert C. Martin</author>\n  <ye ...
[4] <book>\n  <title>The Pragmatic Programmer</title>\n  <author>Andrew Hunt, David ...

Title: Learning Python 
Author: Mark Lutz 
Year: 2013 
Publisher: O'Reilly Media 

Title: Effective Java 
Author: Joshua Bloch 
Year: 2018 
Publisher: Addison-Wesley 

Title: Clean Code 
Author: Robert C. Martin 
Year: 2008 
Publisher: Prentice Hall 

Title: The Pragmatic Programmer 
Author: Andrew Hunt, David Thomas 
Year: 1999 
Publisher: Addison-Wesley

In this example, xml_find_first: It can be used to find the first occurance of hte chid node within the each <book> node. xml_text can be extracts the text content of the node.

Conclusion

Extracting the nodes with same label in the R is straightforward with the xml2 package. By the using functions like read_xml, xml_find_all and xml_find_first, we can parse the XML documents and access the data contained within the specific nodes. Handling the multiple nodes can involves the looping through the extracted nodes and processing the each one as needed. By understanding the structure of the XML document and using the appropriate functions, we can efficiently extract and manipulate the XML data in R.

How to Extract Node with Same Label in R

syam1270

Improve

Article Tags :

How to Extract Node with Same Label in R

What is Node?

Node Extraction Using XML Package

Step 1: Install and load the xml2 package

Handling the Multiple Nodes

Step 1: Create the books.xml File

Step 2: R Code Implementation

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?