How to Read Large JSON file in R

First, it is important to understand that JSON (JavaScript Object Notation), is a lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate. JSON files are often used for data transmission between a server and a web application and can be quite large in size.

In this article, we'll cover the basics of using read_json and split to read large JSON files in R. We'll also explore some advanced techniques for optimizing performance and reducing memory usage. Whether you're a seasoned R programmer or a beginner, this article will provide you with the knowledge and skills you need to read large JSON files in R with confidence.

Read Large JSON files in R using read_json()

read_json is a function from the jsonlite package that allows you to read JSON files in a memory-efficient way. It reads the file line by line, so it only loads a small portion of the data into memory at a time. This makes it a great choice for reading large JSON files.

Install the jsonlite library and load it

To read a large JSON file in R, one of the most popular packages is jsonlite. This package provides a simple and efficient way to parse JSON data and convert it into an R object. To install jsonlite, you can use the following command:

install.packages("jsonlite")
library(jsonlite)

Creating Random Dataset

Here we are creating our own dataset, you can create your own or you can use any JSON large dataset from any site.

library(jsonlite)

# generate random id
generate_id <- function() paste0(sample(c(letters, 
                                LETTERS, 0:9), 10, 
                                replace=TRUE), 
                                 collapse="")

# real first names of people
first_names <- c("John", "Jane", "Michael",
                 "Emily", "William", "Ashley", 
                 "David", "Jessica", "Andrew", 
                 "Jennifer", 
                 "Matthew", "Sarah", "Daniel", 
                 "Amanda", "Christopher", "Elizabeth", 
                 "Nicholas", "Megan", "Robert", 
                 "Lauren", "Joseph", "Ava", "Jacob", 
                 "Sophia", "Jonathan", "Natalie", "Ryan",
                 "Madison", "Adam", "Chloe")

# real last names of people
last_names <- c("Smith", "Johnson", "Williams", "Jones", 
                "Brown", "Davis", "Miller", "Wilson", 
                "Moore", "Taylor", 
                "Anderson", "Thomas", "Jackson", "White",
                "Harris", "Martin", "Thompson", "Garcia", 
                "Martinez", "Robinson", 
                "Clark", "Rodriguez", "Lewis", "Lee", "Walker",
                "Hall", "Allen", "King", "Wright", "Scott")

# education qualifications
qualifications <- c("Primary Education", "Secondary Education", 
                    "High School", "Undergraduate", "Postgraduate")

# create a data frame
df <- data.frame(ID = sapply(1:1000000, 
                       function(i) generate_id()),
                 First_Name = sample(first_names,
                            1000000, replace = TRUE),
                 Last_Name = sample(last_names, 
                             1000000, replace = TRUE),
                 Age = sample(18:30, 1000000, 
                              replace = TRUE),
                 Highest_qualification = 
                 sample(qualifications, 1000000, 
                        replace = TRUE),
                 stringsAsFactors = FALSE)

# write the data frame to a JSON file
write_json(df, "people.json")

You can check the size of the file using the following code.

file.info("people.json")$size

Output:

113428352

Read the JSON file into R

The read_json() function will automatically detect the data structure of the JSON file and convert it into an R object, which can be a list or a data frame. Once you have the data in an R object, you can use all the standard R functions and packages to manipulate and analyze it.

You can use the read_json() function to read a JSON file into R. For example, to read a JSON file called "data.json" in your working directory, you would use the following code:

data <- jsonlite::read_json("file.json")
head(data, 3)

Output:

Split Large JSON files in R using Split

The split is a base R function that allows you to split a large file into smaller pieces. This can be useful when working with large JSON files as it reduces the memory footprint of your data. By splitting the file into smaller pieces, you can process each piece separately and then combine the results.

In this example project, you can see how to use the split method to read large JSON files in R. The project starts by generating a large dataset of 1 million rows. This dataset is then saved to a JSON file, which serves as the large JSON file that you want to read in R.

Install and Loading the Required Package

To split a large JSON file in R, you will need to have the split package installed. You can install it using the following code. Once the package is installed, you can load it using the following code:

install.packages("split")
library(split)

Determine the Number of Rows in the File

Next, you need to specify the file path of the large JSON file. To split the large JSON file into smaller files, you need to determine the number of rows in the file and use the ceiling() function from the base package to round up to the nearest integer.

file_path <- "S:\\data.json"

# Expected number of rows in each chunk
chunk_size <- 100000 

# Open the input file
data_stream <- stream_in(file(file_path), 
                         simplifyDataFrame = TRUE, 
                         pagesize = chunk_size)
n_rows <- nrow(data_stream)
n_chunks <- ceiling(n_rows / chunk_size)

Split the Large JSON File

Finally, you can use the split() function to split the large JSON file into smaller files.

# split data into  parts
parts <- split(data_stream, 1:n_chunks)

Write each part in a Separate File

Next, the split method is used to split the large JSON file into smaller pieces. The split function takes two arguments: the file to be split and the number of lines that each split file should contain. In this example, the large JSON file is split into 10 smaller files, each containing 100,000 lines.

for (i in 1:n_chunks) {
 write(toJSON(parts[[i]]), paste0("part_", i, ".json"))
}

Complete Code

With these simple steps, you can split a large JSON file in R into smaller files, making it easier to process the data in R. Whether you are working with large datasets or just want to organize your data more efficiently, this method can be a useful tool in your R programming arsenal.

# load data
library(jsonlite)
file_path <- "S:\\data.json"
chunk_size <- 100000 # Expected number of rows in each chunk

# Open the input file
data_stream <- stream_in(file(file_path), 
                         simplifyDataFrame = TRUE, 
                         pagesize = chunk_size)
n_rows <- nrow(data_stream)
n_chunks <- ceiling(n_rows / chunk_size)

# split data into  parts
parts <- split(data_stream, 1:n_chunks)

# write each part to a separate file
for (i in 1:n_chunks) {
  write(toJSON(parts[[i]]), paste0("S:\\part_", i, ".json"))
}

Output:

How to Read Large JSON file in R

Read Large JSON files in R using read_json()

Creating Random Dataset

Read the JSON file into R

Split Large JSON files in R using Split

Install and Loading the Required Package

Determine the Number of Rows in the File

Split the Large JSON File

Write each part in a Separate File

Complete Code

Explore