Pyspark - Parse a Column of JSON Strings
Last Updated :
26 Apr, 2025
In this article, we are going to discuss how to parse a column of json strings into their own separate columns. Here we will parse or read json string present in a csv file and convert it into multiple dataframe columns using Python Pyspark.
Example 1: Parse a Column of JSON Strings Using pyspark.sql.functions.from_json
For parsing json string we'll use from_json() SQL function to parse the column containing json string into StructType with the specified schema. If the string is unparseable, it returns null.
The movie_input.csv file contains 15 records containing movie details(title, rating, releaseYear and genre) present in a JSON string. We want to read this file and parse the json string to extract the movie details into their own separate columns title, rating, releaseYear and genre.
This function requires two required parameters:
- col: Name of column that contains the json string.
- schema: a StructType or ArrayType of StructType to use when parsing the json column.
Python3
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName('Parse \
a column of json strings').getOrCreate()
df = spark.read.load('movie_input.csv', header=True, format="csv")
df.show()
df.printSchema()
Output:
Â
Once we have read the data into a dataframe, now let's convert the JSON column into multiple columns using from_json(). As mentioned above this function takes the column name with JSON string and the JSON schema as arguments, so let's create the schema that represents our data.
Python3
import pyspark.sql.types as T
#schema to represent out json data
schema = T.StructType(
[
T.StructField('title', T.StringType(), True),
T.StructField('rating', T.StringType(), True),
T.StructField('releaseYear', T.StringType(), True),
T.StructField('genre', T.StringType(), True)
]
)
Now, lets use the from_json() function which returns the Column struct with all the json columns.
Python3
import pyspark.sql.functions as F
mapped_df = df.withColumn("movie", F.from_json("movie", schema))
mapped_df.show(truncate=False)
mapped_df.printSchema()
Output:
Â
And finally, we explode the json struct to flatten it using the select method. We could have selected on cols movie.title, movie.rating, ..etc. But the better approach is to use * wildcard character which would select all the columns which has movie. prefix.
Python3
import pyspark.sql.functions as F
parsed_df = mapped_df.select(F.col('id'), F.col("movie.*"))
parsed_df.show(truncate=False)
parsed_df.printSchema()
Output:
 Example 2: Parse a column of json strings using to_json() and json_tuple()
For this, we'll be using to_json() and json_tuple() in addition to the above method.
- First, we'll map the JSON string column to MapType with the help of from_json() function.
- Then we'll convert the MapType column to JSON string. If the JSON data was incorrect, the function would throw an exception.
- Finally, we'll create new columns for the JSON column according to the field names passed.
Python3
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
if __name__ == "__main__":
spark = SparkSession.builder.appName('Parse a\
column of json strings').getOrCreate()
df = spark.createDataFrame(
[
["1","{'color': 'red', 'value': '#f00'}"],
["2","{'color': 'green', 'value': '#0f0'}"],
["3","{'color': 'blue', 'value': '#00f'}"],
["4","{'color': 'cyan', 'value': '#0ff'}"],
["5","{'color': 'magenta', 'value': '#f0f'}"],
["6","{'color': 'yellow', 'value': '#ff0'}"],
["7","{'color': 'black', 'value': '#000'}"],
]
).toDF('id', 'colors')
df.show(truncate=False)
df.printSchema()
df = df.withColumn("colors",
F.from_json(df.colors,
T.MapType(T.StringType(),
T.StringType())))
df.show(truncate=False)
df.printSchema()
df = df.withColumn("colors", F.to_json(df.colors))
df.show(truncate=False)
df.printSchema()
df = df.select('id', F.json_tuple(F.col("colors"),
"color", "value")
).toDF('id', 'color', 'value')
df.show(truncate=False)
df.printSchema()
Output:
+---+-------------------------------------+
|id |colors |
+---+-------------------------------------+
|1 |{'color': 'red', 'value': '#f00'} |
|2 |{'color': 'green', 'value': '#0f0'} |
|3 |{'color': 'blue', 'value': '#00f'} |
|4 |{'color': 'cyan', 'value': '#0ff'} |
|5 |{'color': 'magenta', 'value': '#f0f'}|
|6 |{'color': 'yellow', 'value': '#ff0'} |
|7 |{'color': 'black', 'value': '#000'} |
+---+-------------------------------------+
root
|-- id: string (nullable = true)
|-- colors: string (nullable = true)
+---+---------------------------------+
|id |colors |
+---+---------------------------------+
|1 |{color -> red, value -> #f00} |
|2 |{color -> green, value -> #0f0} |
|3 |{color -> blue, value -> #00f} |
|4 |{color -> cyan, value -> #0ff} |
|5 |{color -> magenta, value -> #f0f}|
|6 |{color -> yellow, value -> #ff0} |
|7 |{color -> black, value -> #000} |
+---+---------------------------------+
root
|-- id: string (nullable = true)
|-- colors: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---+----------------------------------+
|id |colors |
+---+----------------------------------+
|1 |{"color":"red","value":"#f00"} |
|2 |{"color":"green","value":"#0f0"} |
|3 |{"color":"blue","value":"#00f"} |
|4 |{"color":"cyan","value":"#0ff"} |
|5 |{"color":"magenta","value":"#f0f"}|
|6 |{"color":"yellow","value":"#ff0"} |
|7 |{"color":"black","value":"#000"} |
+---+----------------------------------+
root
|-- id: string (nullable = true)
|-- colors: string (nullable = true)
+---+-------+-----+
|id |color |value|
+---+-------+-----+
|1 |red |#f00 |
|2 |green |#0f0 |
|3 |blue |#00f |
|4 |cyan |#0ff |
|5 |magenta|#f0f |
|6 |yellow |#ff0 |
|7 |black |#000 |
+---+-------+-----+
root
|-- id: string (nullable = true)
|-- color: string (nullable = true)
|-- value: string (nullable = true)
Similar Reads
Pivot String column on Pyspark Dataframe
Pivoting in data analysis refers to the transformation of data from a long format to a wide format by rotating rows into columns. In PySpark, pivoting is used to restructure DataFrames by turning unique values from a specific column (often categorical) into new columns, with the option to aggregate
4 min read
Download and Parse JSON Using R
In this article, we are going to learn how to download and parse JSON using the R programming language. JavaScript Object Notation is referred to as JSON. These files have the data in text form, which is readable by humans. The JSON files are open for reading and writing just like any other file. Th
4 min read
Split a List to Multiple Columns in Pyspark
Have you ever been stuck in a situation where you have got the data of numerous columns in one column? Got confused at that time about how to split that dataset? This can be easily achieved in Pyspark in numerous ways. In this article, we will discuss regarding same. Modules Required: Pyspark: An o
5 min read
Spark - Split array to separate column
Apache Spark is a potent big data processing system that can analyze enormous amounts of data concurrently over distributed computer clusters. PySpark is a Python-based interface for Apache Spark. Python programmers may create Spark applications more quickly and easily thanks to PySpark. Method 1: U
3 min read
Spark Trim String Column on DataFrame
In this article, we will see that in PySpark, we can remove white spaces in the DataFrame string column. Here we will perform a similar operation to trim() (removes left and right white spaces) present in SQL in PySpark itself. PySpark Trim String Column on DataFrameBelow are the ways by which we ca
4 min read
How to parse nested JSON using Scala Spark?
In this article, we will learn how to parse nested JSON using Scala Spark. To parse nested JSON using Scala Spark, you can follow these steps:Define the schema for your JSON data.Read the JSON data into a Datc aFrame.Select and manipulate the DataFrame columns to work with the nested structure.Scala
1 min read
Select columns in PySpark dataframe
In this article, we will learn how to select columns in PySpark dataframe. Function used: In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats. Syntax: dataframe_name.select( columns_names ) Note: We
4 min read
Split multiple array columns into rows in Pyspark
Suppose we have a Pyspark DataFrame that contains columns having different types of values like string, integer, etc., and sometimes the column data is in array format also. Working with the array is sometimes difficult and to remove the difficulty we wanted to split those array data into rows. Spl
5 min read
How to Get substring from a column in PySpark Dataframe ?
In this article, we are going to see how to get the substring from the PySpark Dataframe column and how to create the new column and put the substring in that newly created column. We can get the substring of the column using substring() and substr() function. Syntax: substring(str,pos,len) df.col_n
3 min read
Python PySpark - Drop columns based on column names or String condition
In this article, we will be looking at the step-wise approach to dropping columns based on column names or String conditions in PySpark. Stepwise ImplementationStep1: Create CSV Under this step, we are simply creating a CSV file with three rows and columns. CSV Used: Step 2: Import PySpark Library U
2 min read