How to Change Column Type in PySpark Dataframe ?
Last Updated :
18 Jul, 2021
In this article, we are going to see how to change the column type of pyspark dataframe.
Creating dataframe for demonstration:
Python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'SparkExamples' ).getOrCreate()
columns = [ "Name" , "Course_Name" ,
"Duration_Months" ,
"Course_Fees" , "Start_Date" ,
"Payment_Done" ]
data = [
( "Amit Pathak" , "Python" , 3 ,
10000 , "02-07-2021" , True ),
( "Shikhar Mishra" , "Soft skills" ,
2 , 8000 , "07-10-2021" , False ),
( "Shivani Suvarna" , "Accounting" ,
6 , 15000 , "20-08-2021" , True ),
( "Pooja Jain" , "Data Science" , 12 ,
60000 , "02-12-2021" , False ),
]
course_df = spark.createDataFrame(data).toDF( * columns)
course_df.show()
|
Output:

Let’s see the schema of dataframe:
Output:

Method 1: Using DataFrame.withColumn()
The DataFrame.withColumn(colName, col) returns a new DataFrame by adding a column or replacing the existing column that has the same name.
We will make use of cast(x, dataType) method to casts the column to a different data type. Here, the parameter “x” is the column name and dataType is the datatype in which you want to change the respective column to.
Example 1: Change datatype of single columns.
Python
course_df2 = course_df.withColumn( "Course_Fees" ,
course_df[ "Course_Fees" ]
.cast( 'float' ))
course_df2.printSchema()
|
Output:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: long (nullable = true)
|-- Course_Fees: float (nullable = true)
|-- Start_Date: string (nullable = true)
|-- Payment_Done: boolean (nullable = true)
In the above example, we can observe that the “Course_Fees” column datatype is changed to float from long.
Example 2: Change datatype of multiple columns.
Python
from pyspark.sql.types import StringType, DateType, FloatType
course_df3 = course_df \
.withColumn( "Course_Fees" ,
course_df[ "Course_Fees" ]
.cast(FloatType())) \
.withColumn( "Payment_Done" ,
course_df[ "Payment_Done" ]
.cast(StringType())) \
.withColumn( "Start_Date" ,
course_df[ "Start_Date" ]
.cast(DateType())) \
course_df3.printSchema()
|
Output:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: long (nullable = true)
|-- Course_Fees: float (nullable = true)
|-- Start_Date: date (nullable = true)
|-- Payment_Done: string (nullable = true)
In the above example, we changed the datatype of columns “Course_Fees”, “Payment_Done”, and “Start_Date” to “float”, “str” and “datetype” respectively.
Method 2: Using DataFrame.select()
Here we will use select() function, this function is used to select the columns from the dataframe
Syntax: dataframe.select(columns)
Where dataframe is the input dataframe and columns are the input columns
Example 1: Change a single column.
Let us convert the `course_df3` from the above schema structure, back to the original schema.
Python
from pyspark.sql.types import StringType, BooleanType, IntegerType
course_df4 = course_df3.select(
course_df3.Name,
course_df3.Course_Name,
course_df3.Duration_Months,
(course_df3.Course_Fees.cast(IntegerType()))
.alias( 'Course_Fees' ),
(course_df3.Start_Date.cast(StringType()))
.alias( 'Start_Date' ),
(course_df3.Payment_Done.cast(BooleanType()))
.alias( 'Payment_Done' ),
)
course_df4.printSchema()
|
Output:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: long (nullable = true)
|-- Course_Fees: integer (nullable = true)
|-- Start_Date: string (nullable = true)
|-- Payment_Done: boolean (nullable = true)
Example 2: Changing multiple columns to the same datatype.
Python
from pyspark.sql.types import StringType
course_df5 = course_df.select(
[course_df.cast(StringType())
.alias(c) for c in course_df.columns]
)
course_df5.printSchema()
|
Output:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: string (nullable = true)
|-- Course_Fees: string (nullable = true)
|-- Start_Date: string (nullable = true)
|-- Payment_Done: string (nullable = true)
Example 3: Changing multiple columns to the different datatypes.
Let us use the `course_df5` which has all the column type as `string`. We will change the column types to a respective format.
Python
from pyspark.sql.types import (
StringType, BooleanType, IntegerType, FloatType, DateType
)
coltype_map = {
"Name" : StringType(),
"Course_Name" : StringType(),
"Duration_Months" : IntegerType(),
"Course_Fees" : FloatType(),
"Start_Date" : DateType(),
"Payment_Done" : BooleanType(),
}
course_df6 = course_df5.select(
[course_df5.cast(coltype_map)
.alias(c) for c in course_df5.columns]
)
course_df6.printSchema()
|
Output:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: integer (nullable = true)
|-- Course_Fees: float (nullable = true)
|-- Start_Date: date (nullable = true)
|-- Payment_Done: boolean (nullable = true)
Method 3: Using spark.sql()
Here we will use SQL query to change the column type.
Syntax: spark.sql(“sql Query”)
Example: Using spark.sql()
Python
course_df5.createOrReplaceTempView( "course_view" )
course_df7 = spark.sql(
)
course_df7.printSchema()
|
Output:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: integer (nullable = true)
|-- Course_Fees: float (nullable = true)
|-- Start_Date: date (nullable = true)
|-- Payment_Done: boolean (nullable = true)
Similar Reads
How to change dataframe column names in PySpark ?
In this article, we are going to see how to change the column names in the pyspark data frame. Let's create a Dataframe for demonstration: C/C++ Code # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark - example
3 min read
How to verify Pyspark dataframe column type ?
While working with a big Dataframe, Dataframe consists of any number of columns that are having different datatypes. For pre-processing the data to apply operations on it, we have to know the dimensions of the Dataframe and datatypes of the columns which are present in the Dataframe. In this article
4 min read
How to add a new column to a PySpark DataFrame ?
In this article, we will discuss how to add a new column to PySpark Dataframe. Create the first data frame for demonstration: Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose. C/C++ Code # importing module import pyspark # importing sp
8 min read
How to name aggregate columns in PySpark DataFrame ?
In this article, we are going to see how to name aggregate columns in the Pyspark dataframe. We can do this by using alias after groupBy(). groupBy() is used to join two columns and it is used to aggregate the columns, alias is used to change the name of the new column which is formed by grouping da
2 min read
How to delete columns in PySpark dataframe ?
In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: C/C++ Code # importi
2 min read
How to Add Multiple Columns in PySpark Dataframes ?
In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi C/C++ Code # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksess
2 min read
How to rename a PySpark dataframe column by index?
In this article, we are going to know how to rename a PySpark Dataframe column by index using Python. we can rename columns by index using Dataframe.withColumnRenamed() and Dataframe.columns[] methods. with the help of Dataframe.columns[] we get the name of the column on the particular index and the
2 min read
Applying function to PySpark Dataframe Column
In this article, we're going to learn 'How we can apply a function to a PySpark DataFrame Column'. Apache Spark can be used in Python using PySpark Library. PySpark is an open-source Python library usually used for data analytics and data science. Pandas is powerful for data analysis but what makes
4 min read
How to add a constant column in a PySpark DataFrame?
In this article, we are going to see how to add a constant column in a PySpark Dataframe. It can be done in these ways: Using Lit()Using Sql query. Creating Dataframe for demonstration: C/C++ Code # Create a spark session from pyspark.sql import SparkSession from pyspark.sql.functions import lit spa
2 min read
How to show full column content in a PySpark Dataframe ?
Sometimes in Dataframe, when column data containing the long content or large sentence, then PySpark SQL shows the dataframe in compressed form means the first few words of the sentence are shown and others are followed by dots that refers that some more data is available. From the above sample Data
5 min read