Open In App

How to Extract Node with Same Label in R

Last Updated : 08 Aug, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

XML (eXtensible Markup Language) is the markup language that defines the set of rules for encoding documents in a human-readable and machine-readable format. The XML document can be structured with nested elements called nodes. These nodes contain the text attributes and other nested nodes. in this article, we will discuss How to Extract Nodes with the Same Label in the R Programming Language.

What is Node?

Node is An element within the document structure, which may contain tags, attributes, and text. For example, in HTML, <div>, <p>, and <a> are nodes. now we will create one XML Structure.

<books>
<book>
<title>C Programming</title>
<author>Author A</author>
</book>
<book>
<title>Java</title>
<author>Author B</author>
</book>
</books>'

In this example, <books> is the root node and it can contains the multiple <book> nodes. Each <book> node contains the <title> and <author> nodes of the XML file.

Node Extraction Using XML Package

The xml2 package in R is the powerful tool for parsing and extracting the data from XML documents.

Step 1: Install and load the xml2 package

We can install it using the below command.

install.packages("xml2")

library(xml2)

To work with the XML nodes in the R, we first need to load the XML file and parse it into the R object. Here the how you can do that:

# Load and parse the XML file
xml_file <- read_xml("path_to_your_file.xml")

We can extract the nodes with the specific label and use the xml_find_all function from the xml2 package. This function can allows you to find all the nodes that match the given XPath expression.

R
# Install and load necessary packages
library(xml2)

# Path to your XML file
file_path <- "C:\\Users\\GFG19565\\Downloads\\sa1.xml"

# Read and parse the XML file
xml_doc <- read_xml(file_path)

# Print the XML content (for debugging purposes)
print(xml_doc)

# Extract all nodes with the same label (e.g., label="A")
nodes <- xml_find_all(xml_doc, "//item[@label='A']")

# Display the content of these nodes
node_texts <- xml_text(nodes)
print(node_texts)

Output:

<book><title>Python</title><author>Author C</author></book>

Handling the Multiple Nodes

When you have multiple the node with the same label, we may want to extract the specific information from each one. We can loop through the extracted the nodes and access their child nodes or attributes.

Step 1: Create the books.xml File

We can create the simple xml file contains the multiple books names and it can be saved as books.xml.

XML
<books>
  <book>
    <title>Learning Python</title>
    <author>Mark Lutz</author>
    <year>2013</year>
    <publisher>O'Reilly Media</publisher>
  </book>
  <book>
    <title>Effective Java</title>
    <author>Joshua Bloch</author>
    <year>2018</year>
    <publisher>Addison-Wesley</publisher>
  </book>
  <book>
    <title>Clean Code</title>
    <author>Robert C. Martin</author>
    <year>2008</year>
    <publisher>Prentice Hall</publisher>
  </book>
  <book>
    <title>The Pragmatic Programmer</title>
    <author>Andrew Hunt, David Thomas</author>
    <year>1999</year>
    <publisher>Addison-Wesley</publisher>
  </book>

</books>

Step 2: R Code Implementation

We can use the below code to extract the node with same label in R.

R
library(xml2)

# Load and parse the XML file
xml_file <- read_xml("C:/Users/Syam/OneDrive/Desktop/PHP Projects/books.xml")

# Extract all <book> nodes
book_nodes <- xml_find_all(xml_file, "//book")

# Print the extracted nodes
print(book_nodes)

# Loop through each <book> node and extract information
for (book in book_nodes) {
  title <- xml_text(xml_find_first(book, "title"))
  author <- xml_text(xml_find_first(book, "author"))
  year <- xml_text(xml_find_first(book, "year"))
  publisher <- xml_text(xml_find_first(book, "publisher"))
  
  cat("Title:", title, "\n")
  cat("Author:", author, "\n")
  cat("Year:", year, "\n")
  cat("Publisher:", publisher, "\n\n")
}

Output:

{xml_nodeset (4)}
[1] <book>\n <title>Learning Python</title>\n <author>Mark Lutz</author>\n <year ...
[2] <book>\n <title>Effective Java</title>\n <author>Joshua Bloch</author>\n <ye ...
[3] <book>\n <title>Clean Code</title>\n <author>Robert C. Martin</author>\n <ye ...
[4] <book>\n <title>The Pragmatic Programmer</title>\n <author>Andrew Hunt, David ...

Title: Learning Python
Author: Mark Lutz
Year: 2013
Publisher: O'Reilly Media

Title: Effective Java
Author: Joshua Bloch
Year: 2018
Publisher: Addison-Wesley

Title: Clean Code
Author: Robert C. Martin
Year: 2008
Publisher: Prentice Hall

Title: The Pragmatic Programmer
Author: Andrew Hunt, David Thomas
Year: 1999
Publisher: Addison-Wesley

In this example, xml_find_first: It can be used to find the first occurance of hte chid node within the each <book> node. xml_text can be extracts the text content of the node.

Conclusion

Extracting the nodes with same label in the R is straightforward with the xml2 package. By the using functions like read_xml, xml_find_all and xml_find_first, we can parse the XML documents and access the data contained within the specific nodes. Handling the multiple nodes can involves the looping through the extracted nodes and processing the each one as needed. By understanding the structure of the XML document and using the appropriate functions, we can efficiently extract and manipulate the XML data in R.


Next Article
Article Tags :

Similar Reads