PySpark - Apply custom schema to a DataFrame
Last Updated :
28 Apr, 2025
In this article, we are going to apply custom schema to a data frame using Pyspark in Python.
A distributed collection of rows under named columns is known as a Pyspark data frame. Usually, the schema of the Pyspark data frame is inferred from the data frame itself, but Pyspark also gives the feature to customize the schema according to the needs. This can be done easily by defining the new schema and by loading it into the respective data frame. Read the article further to know about it in detail.
What is Schema?
The structure of the data frame which we can get by calling the printSchema() method on the data frame object is known as the Schema in Pyspark. Basically, schema defines the structure of the data frame such as data type of a column and boolean value indication (If column's value can be null or not). The schema can be defined by using the StructType class which is a collection of StructField that defines the column name, column type, nullable column, and metadata.
Methods to apply custom schema to a Pyspark DataFrame
- Applying custom schema by changing the name.
- Applying custom schema by changing the type.
- Applying custom schema by changing the metadata.
Method 1: Applying custom schema by changing the name
As we know, whenever we create the data frame or upload the CSV file, it has some predefined schema, but if we don't want it and want to change it according to our needs, then it is known as applying a custom schema. The custom schema has two fields 'column_name' and 'column_type'. In this way, we will see how we can apply the customized schema to the data frame by changing the names in the schema.
Syntax: StructType(StructField('column_name_1', column_type(), Boolean_indication))
Parameters:
- column_name_1, column_name_2: These are the column names given to the data frame while applying custom schema.
- column_type: These are the types to be given to columns while applying custom schema.
- Boolean_indication: It takes the input as 'True' or 'False' that defines whether the column contains null value or not.
Example:
In this example, we have defined the customized schema with columns 'Student_Name' of StringType, 'Student_Age' of IntegerType, 'Student_Subject' of StringType, 'Student_Class' of IntegerType, 'Student_Fees' of IntegerType. Then, we loaded the CSV file (link) whose schema is as follows:
Finally, we applied the customized schema to that CSV file by changing the names and displaying the updated schema of the data frame.
Python3
# PySpark - Apply custom schema to a DataFrame by changing names
# Import the libraries SparkSession, StructType,
# StructField, StringType, IntegerType
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Define the structure for the data frame
schema = StructType([
StructField('Student_Name',
StringType(), True),
StructField('Student_Age',
IntegerType(), True),
StructField('Student_Subject',
StringType(), True),
StructField('Student_Class',
IntegerType(), True),
StructField('Student_Fees',
IntegerType(), True)
])
# Applying custom schema to data frame
df = spark_session.read.format(
"csv").schema(schema).option(
"header", True).load("/content/student_data.csv")
# Display the updated schema
df.printSchema()
Output:
root
|-- Student_Name: string (nullable = true)
|-- Student_Age: integer (nullable = true)
|-- Student_Subject: string (nullable = true)
|-- Student_Class: integer (nullable = true)
|-- Student_Fees: integer (nullable = true)
Method 2: Applying custom schema by changing the type
As you know, the custom schema has two fields 'column_name' and 'column_type'. In a previous way, we saw how we can change the name in the schema of the data frame, now in this way, we will see how we can apply the customized schema to the data frame by changing the types in the schema.
Example:
In this example, we have read the CSV file (link), i.e., basically a dataset of 5*5, whose schema is as follows:
Then, we applied a custom schema by changing the type of column 'fees' from Integer to Float using the cast function and printed the updated schema of the data frame.
Python3
# Apply custom schema to a DataFrame by changing column type
# Import the libraries SparkSession
from pyspark.sql import SparkSession
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Read the CSV file
data_frame = csv_file = spark_session.read.csv(
'/content/student_data.csv',
sep=',', inferSchema=True, header=True)
# Applying custom schema to data frame by changing column type
data_frame1 = data_frame.withColumn('fees',
data_frame['fees'].cast('float'))
# Display the updated schema
data_frame1.printSchema()
Output:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- subject: string (nullable = true)
|-- class: integer (nullable = true)
|-- fees: float (nullable = true)
Method 3: Applying custom schema by changing the metadata
The custom schema usually has two fields 'column_name' and 'column_type' but we can also define one other field, i.e., 'metadata'. The metadata is basically a small description of the column. In this way, we will see how we can apply the customized schema using metadata to the data frame.
Example:
In this example, we have defined the customized schema with columns 'Student_Name' of StringType with metadata 'Name of the student', 'Student_Age' of IntegerType with metadata 'Age of the student', 'Student_Subject' of StringType with metadata 'Subject of the student', 'Student_Class' of IntegerType with metadata 'Class of the student', 'Student_Fees' of IntegerType with metadata 'Fees of the student'. Then, we loaded the CSV file (link) whose schema is as follows:
Finally, we applied the customized schema to that CSV file and displayed the schema of the data frame along with the metadata.
Python3
#PySpark - Apply custom schema to a DataFrame by changing metadata
# Import the libraries SparkSession, StructType,
# StructField, StringType, IntegerType
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Define the schema for the data frame
schema = StructType([
StructField('Student_Name', StringType(),
True, metadata={"desc": "Name of the student"}),
StructField('Student_Age', IntegerType(),
True, metadata={"desc": "Age of the student"}),
StructField('Student_Subject', StringType(),
True, metadata={"desc": "Subject of the student"}),
StructField('Student_Class', IntegerType(),
True, metadata={"desc": "Class of the student"}),
StructField('Student_Fees', IntegerType(),
True, metadata={"desc": "Fees of the student"})
])
# Applying custom schema to data frame
df = spark_session.read.format("csv").schema(
schema).option("header",
True).load("/content/student_data.csv")
# Display the updated schema of the data frame
df.printSchema()
# Run a loop to display metadata for each column
for i in range(len(df.columns)):
a=df.schema.fields[i].metadata["desc"]
print('Column ',i+1,': ',a)
Output:
root
|-- Student_Name: string (nullable = true)
|-- Student_Age: integer (nullable = true)
|-- Student_Subject: string (nullable = true)
|-- Student_Class: integer (nullable = true)
|-- Student_Fees: integer (nullable = true)
Column 1 : Name of the student
Column 2 : Age of the student
Column 3 : Subject of the student
Column 4 : Class of the student
Column 5 : Fees of the student
Similar Reads
Apply a transformation to multiple columns PySpark dataframe In this article, we are going to learn how to apply a transformation to multiple columns in a data frame using Pyspark in Python. The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. While using Pyspark
7 min read
Adding StructType columns to PySpark DataFrames In this article, we are going to learn about adding StructType columns to Pyspark data frames in Python. The interface which allows you to write Spark applications using Python APIs is known as Pyspark. While creating the data frame in Pyspark, the user can not only create simple data frames but can
4 min read
Applying a custom function on PySpark Columns with UDF In this article, we are going to learn how to apply a custom function on Pyspark columns with UDF in Python. The most useful feature of Spark SQL & DataFrame that is used to extend the PySpark build-in capabilities is known as UDF in Pyspark. There occurs various circumstances in which we need t
4 min read
Add Suffix and Prefix to all Columns in PySpark In this article, we are going to add suffixes and prefixes to all columns using Pyspark in Python. An open-source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark. While working in Pyspark,
11 min read
How to add a column to a nested struct in a pyspark In this article, we are going to learn how to add a column to a nested struct using Pyspark in Python. Have you ever worked in a Pyspark data frame? If yes, then might surely know how to add a column in Pyspark, but do you know that you can also create a struct in Pyspark? The struct is used to prog
4 min read