Open In App

How to Extract Text from XML File Using R

Last Updated : 21 Aug, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

A markup language that defines the set of rules for encoding documents in a format that is both human-readable and machine-readable. XML can be widely used to represent arbitrary data structures, such as those used in web services. Extracting XML (Extensible Markup Language) is a markup language that defines the rules for encoding documents in a human-readable and machine-readable format. XML can be widely used to represent arbitrary data structures, such as web services. Extracting text from XML files is a common task in data analysis and web scraping. This process can involve parsing the XML structure and retrieving the desired text content from specific nodes.

Using XML and rvest Packages

R can provide several packages to work with XML data including the XML and rvest packages.

  • XML Packages: This project can provide the tools for parsing the XML documents, navigating through the XML tree structure, and extracting the data from the nodes.
  • xml2 Package: It can typically used for web scraping xml2 can also handle the XML data. It can simplify the process of navigating and extracting the data from HTML and XML documents using CSS selectors.

Installing the Packages

To use these packages, we need to install them first.

install.packages("XML")
install.packages("xml2")

Then, load them into the R session.

library(XML)
library(rvest)

Now we will create one XML file so we will Extract Text from XML File Using R Programming Language.

XML
<?xml version="1.0" encoding="UTF-8"?>
<library>
    <book>
        <title>The Catcher in the Rye</title>
        <author>J.D. Salinger</author>
        <year>1951</year>
    </book>
    <book>
        <title>To Kill a Mockingbird</title>
        <author>Harper Lee</author>
        <year>1960</year>
    </book>
    <book>
        <title>1984</title>
        <author>George Orwell</author>
        <year>1949</year>
    </book>
</library>

Here the step-by-step example of how to extract the text from the XML file using the XML and xml2 packages.

1. Extract Text from XML File using the XML package

First we will Extract Text from XML File using the XML package.

R
# Load the XML package
library(XML)

# Parse the XML file
xml_file <- xmlParse("C:/Users/Syam/OneDrive/Desktop/PHP Projects/example.xml")

# Print the XML content
cat(saveXML(xml_file))

# List all 'book' nodes
book_nodes <- getNodeSet(xml_file, "//book")
print(length(book_nodes))

# Inspect first book node
if (length(book_nodes) > 0) {
  cat("First book node:\n")
  print(xmlToList(book_nodes[[1]]))
}

# Extract titles and authors
book_titles <- xpathSApply(xml_file, "//book/title", xmlValue)
book_authors <- xpathSApply(xml_file, "//book/author", xmlValue)

# Print extracted values
cat("Titles:\n")
print(book_titles)
cat("Authors:\n")
print(book_authors)

Output:

[1] 3

First book node:
$title
[1] "The Catcher in the Rye"

$author
[1] "J.D. Salinger"

$year
[1] "1951"

Titles:

[1] "The Catcher in the Rye" "To Kill a Mockingbird" "1984"

Authors:

[1] "J.D. Salinger" "Harper Lee" "George Orwell"

2. Extract Text from XML File Using the xml2 package

Now we will Extract Text from XML File Using the xml2 package.

R
# Load xml2 package
library(xml2)

# Read the XML file
xml_file <- read_xml("C:/Users/Syam/OneDrive/Desktop/PHP Projects/example.xml")

# Extract titles of the books
book_titles <- xml_file %>%
  xml_find_all("//book/title") %>%
  xml_text()

print(book_titles)

# Extract authors of the books
book_authors <- xml_file %>%
  xml_find_all("//book/author") %>%
  xml_text()

print(book_authors)

Output:

[1] "The Catcher in the Rye" "To Kill a Mockingbird"  "1984"                  

[1] "J.D. Salinger" "Harper Lee" "George Orwell"

Handling the Text Data

Once we have extracted the text data from the XML file, we may need to perform the additional processing to clean and format the text. Common tasks include:

  • Removing Whitespace: It can be remove the leading and trailing whitespaces from the extracted text.
  • Replacing the Special Characters: It can be replace or remove the special characters that may be present in the text.
  • Splitting the Text: Split the text into individuals words or sentences for the further analysis.
R
# Example text data
text_data <- c("  Example text 1 ", "Example text 2\n", "Example & text 3")

# Remove leading and trailing whitespace
clean_text <- trimws(text_data)
print(clean_text)

# Replace special characters
clean_text <- gsub("&", "and", clean_text)
print(clean_text)

# Split text into words
words <- strsplit(clean_text, " ")
print(words)

Output:

[1] "Example text 1"     "Example text 2"     "Example and text 3"

[[1]]
[1] "Example" "text" "1"

[[2]]
[1] "Example" "text" "2"

[[3]]
[1] "Example" "and" "text" "3"

Conclusion

Extracting the text from XML files is the straightforward process with the help of the XML and xml2 packages. By parsing the XML structure and using the appropriate functions to navigate and extract the data. We can efficiently retrieve the desired text content. Post extraction text handling can ensures that the data is clean and ready for further analysis. It is essential for the data scientists and analysis working with the XML data in various applications, including the web scraping, data integration and data processing.


Next Article
Article Tags :

Similar Reads