In this article, we are going to see how to convert a data frame to JSON Array using Pyspark in Python.
In Apache Spark, a data frame is a distributed collection of data organized into named columns. It is similar to a spreadsheet or a SQL table, with rows and columns. You can use a data frame to store and manipulate tabular data in a distributed environment. DataFrames are designed to be expressive, efficient, and flexible, and they are a key component of Spark's Structured Streaming API.
What is JSON array?
A JSON (JavaScript Object Notation) array is a data structure that consists of an ordered list of values. It is often used to transmit data between a server and a web application, or between two different applications. JSON arrays are written in a syntax similar to that of JavaScript arrays, with square brackets containing a list of values separated by commas.
Methods to convert a DataFrame to a JSON array in Pyspark:
- Use the .toJSON() method
- Using the toPandas() method
- Using the write.json() method
Method 1: Use the .toJSON() method
The toJSON() method in Pyspark is used to convert pandas data frame to a JSON object. This method takes a number of arguments that allow you to specify the format of the resulting JSON object, such as the index, the data type of the values, and whether to include the column labels in the output.
Stepwise implementation:
Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Create a spark session using the getOrCreate() function.
spark = SparkSession.builder.appName("MyApp").getOrCreate()
Step 3: Create a data frame with sample data.
df = spark.createDataFrame([(1, "Alice", 10), (2, "Bob", 20), (3, "Charlie", 30)],
["id", "name", "age"])
Step 4: Print the data frame.
df.show()
Step 5: Use the .toJSON() function and convert data frame to JSON array
json_array = df.toJSON().collect()
Step 6: Finally, print the JSON array.
print("JSON array:",json_array)
Example:
In this example, we have created a data frame with three columns id, name, and age and converted it to a JSON array using toJSON() method. In the output, the data frame as well as the JSON array are printed.
Python3
# Importing SparkSession
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
# Create a DataFrame
df = spark.createDataFrame([(1, "Alice", 10),
(2, "Bob", 20),
(3, "Charlie", 30)],
["id", "name", "age"])
print("Dataframe: ")
df.show()
# Convert the DataFrame to a JSON array
json_array = df.toJSON().collect()
# Print the JSON array
print("JSON array:",json_array)
Output:
Dataframe:
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1| Alice| 10|
| 2| Bob| 20|
| 3|Charlie| 30|
+---+-------+---+
JSON array:
[
'{"id":1,"name":"Alice","age":10}',
'{"id":2,"name":"Bob","age":20}',
'{"id":3,"name":"Charlie","age":30}'
]
Method 2: Using the toPandas() method
In this method, we will convert the spark data frame to a pandas data frame which has three columns id, name, and age, and then going to convert it to a JSON string using the to_json() method.
Stepwise implementation:
Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Create a spark session using the getOrCreate() function.
spark = SparkSession.builder.appName("MyApp").getOrCreate()
Step 3: Create a data frame with sample data.
df = spark.createDataFrame([(1, "Alice", 10), (2, "Bob", 20), (3, "Charlie", 30)],
["id", "name", "age"])
Step 4: Print the data frame
df.show()
Step 5: Convert the data frame to a pandas data frame
pandas_df = df.toPandas()
Step 6: Convert the panda's data frame to JSON array
json_data = pandas_df.to_json(orient='records')
Step 7: Finally, print the JSON array
print("JSON array:",json_data)
Example:
Python3
# Importing SparkSession
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
# Create a DataFrame
df = spark.createDataFrame([(1, "Joseph", 10),
(2, "Jack", 20),
(3, "Elon", 30)],
["id", "name", "age"])
print("Dataframe: ")
df.show()
# Convert dataframe to a Pandas dataframe
pandas_df = df.toPandas()
# Convert Pandas dataframe to a JSON string
json_data = pandas_df.to_json(orient='records')
# Print the JSON array
print("JSON array:", json_data)
Output:
Dataframe:
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1| Joseph| 10|
| 2| Jack| 20|
| 3| Elon| 30|
+---+-------+---+
JSON array:
[
'{"id":1,"name":"Joseph","age":10}',
'{"id":2,"name":"Jack","age":20}',
'{"id":3,"name":"Elon","age":30}'
]
Method 3: Using the write.json() method
In this method, we will use write.json() to create a JSON file. But this will create a directory called data.json that contains a set of files with the JSON data. If we want to write the JSON data to a single file, we can use the coalesce() method to merge the files and then use the write.json() method.
Stepwise implementation:
Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Create a spark session using the getOrCreate() function.
spark = SparkSession.builder.appName("MyApp").getOrCreate()
Step 3: Create a data frame with sample data.
df = spark.createDataFrame([(1, "Alice", 10),
(2, "Bob", 20),
(3, "Charlie", 30)],
["id", "name", "age"])
Step 4: Use write.json() to create a JSON directory
df.write.json('data.json')
Step 5: Finally, merge the JSON files into a single JSON file.
df.coalesce(1).write.json('data_merged.json')
Example:
In this example, we created a data frame with three columns id, name, and age. The code below creates a directory with multiple JSON files which are then merged into a single file. The output is finally stored in a JSON file named data_merge.json.
Python3
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
# Create a DataFrame
df = spark.createDataFrame([(1, "Donald", 51),
(2, "Riya", 23),
(3, "Vani", 22)],
["id", "name", "age"])
# Write dataframe to a JSON file
df.write.json('data.json')
# Merge JSON files into a single file
df.coalesce(1).write.json('data_merged.json')
# Printing data frame
df.show()
Output:
[{"id":1,"name":"Donald","age":51},
{"id":2,"name":"Riya","age":23},
{"id":3,"name":"Vani","age":22}]
Similar Reads
Pyspark - Converting JSON to DataFrame In this article, we are going to convert JSON String to DataFrame in Pyspark. Method 1: Using read_json() We can read JSON files using pandas.read_json. This method is basically used to read JSON files through pandas. Syntax: pandas.read_json("file_name.json") Here we are going to use this JSON file
1 min read
How to convert pandas DataFrame into JSON in Python? We are given a pandas DataFrame, and our task is to convert it into JSON format using different orientations and custom options. JSON (JavaScript Object Notation) is a lightweight, human-readable format used for data exchange. With Pandas, this can be done easily using the to_json() method. For exam
4 min read
Apply function to each row of Spark DataFrame Spark is an open-source, distributed computing system used for processing large data sets across a cluster of computers. It has become increasingly popular due to its ability to handle the big data processing in real-time. Spark's DataFrame API, which offers a practical and effective method for carr
8 min read
Split Dataframe in Row Index in Pyspark In this article, we are going to learn about splitting Pyspark data frame by row index in Python. In data science. there is a bulk of data and their is need of data processing and lots of modules, functions and methods are available to process data. In this article we are going to process data by sp
5 min read
PySpark dataframe foreach to fill a list In this article, we are going to learn how to make a list of rows in Pyspark dataframe using foreach using Pyspark in Python. PySpark is a powerful open-source library for working on large datasets in the Python programming language. It is designed for distributed computing and it is commonly used f
3 min read
How to create an empty PySpark DataFrame ? In PySpark, an empty DataFrame is one that contains no data. You might need to create an empty DataFrame for various reasons such as setting up schemas for data processing or initializing structures for later appends. In this article, weâll explore different ways to create an empty PySpark DataFrame
4 min read