How to Change Column Type in PySpark Dataframe ?
Last Updated :
18 Jul, 2021
In this article, we are going to see how to change the column type of pyspark dataframe.
Creating dataframe for demonstration:
Python
# Create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkExamples').getOrCreate()
# Create a spark dataframe
columns = ["Name", "Course_Name",
"Duration_Months",
"Course_Fees", "Start_Date",
"Payment_Done"]
data = [
("Amit Pathak", "Python", 3,
10000, "02-07-2021", True),
("Shikhar Mishra", "Soft skills",
2, 8000, "07-10-2021", False),
("Shivani Suvarna", "Accounting",
6, 15000, "20-08-2021", True),
("Pooja Jain", "Data Science", 12,
60000, "02-12-2021", False),
]
course_df = spark.createDataFrame(data).toDF(*columns)
# View the dataframe
course_df.show()
Output:
Let's see the schema of dataframe:
Python
# View the column datatypes
course_df.printSchema()
Output:

Method 1: Using DataFrame.withColumn()
The DataFrame.withColumn(colName, col) returns a new DataFrame by adding a column or replacing the existing column that has the same name.
We will make use of cast(x, dataType) method to casts the column to a different data type. Here, the parameter "x" is the column name and dataType is the datatype in which you want to change the respective column to.
Example 1: Change datatype of single columns.
Python
# Cast Course_Fees from integer type to float type
course_df2 = course_df.withColumn("Course_Fees",
course_df["Course_Fees"]
.cast('float'))
course_df2.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: long (nullable = true)
|-- Course_Fees: float (nullable = true)
|-- Start_Date: string (nullable = true)
|-- Payment_Done: boolean (nullable = true)
In the above example, we can observe that the "Course_Fees" column datatype is changed to float from long.Â
Example 2: Change datatype of multiple columns.
Python
# We can also make use of datatypes from
# pyspark.sql.types
from pyspark.sql.types import StringType, DateType, FloatType
course_df3 = course_df \
.withColumn("Course_Fees" ,
course_df["Course_Fees"]
.cast(FloatType())) \
.withColumn("Payment_Done",
course_df["Payment_Done"]
.cast(StringType())) \
.withColumn("Start_Date" ,
course_df["Start_Date"]
.cast(DateType())) \
course_df3.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: long (nullable = true)
|-- Course_Fees: float (nullable = true)
|-- Start_Date: date (nullable = true)
|-- Payment_Done: string (nullable = true)
In the above example, we changed the datatype of columns "Course_Fees", "Payment_Done", and "Start_Date" to "float", "str" and "datetype" respectively.
Method 2: Using DataFrame.select()
Here we will use select() function, this function is used to select the columns from the dataframe
Syntax: dataframe.select(columns)
Where dataframe is the input dataframe and columns are the input columns
Example 1: Change a single column.
Let us convert the `course_df3` from the above schema structure, back to the original schema.
Python
from pyspark.sql.types import StringType, BooleanType, IntegerType
course_df4 = course_df3.select(
course_df3.Name,
course_df3.Course_Name,
course_df3.Duration_Months,
(course_df3.Course_Fees.cast(IntegerType()))
.alias('Course_Fees'),
(course_df3.Start_Date.cast(StringType()))
.alias('Start_Date'),
(course_df3.Payment_Done.cast(BooleanType()))
.alias('Payment_Done'),
)
course_df4.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: long (nullable = true)
|-- Course_Fees: integer (nullable = true)
|-- Start_Date: string (nullable = true)
|-- Payment_Done: boolean (nullable = true)
Example 2: Changing multiple columns to the same datatype.
Python
# Changing datatype of all the columns
# to string type
from pyspark.sql.types import StringType
course_df5 = course_df.select(
[course_df[c].cast(StringType())
.alias(c) for c in course_df.columns]
)
course_df5.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: string (nullable = true)
|-- Course_Fees: string (nullable = true)
|-- Start_Date: string (nullable = true)
|-- Payment_Done: string (nullable = true)
Example 3: Changing multiple columns to the different datatypes.
Let us use the `course_df5` which has all the column type as `string`. We will change the column types to a respective format.
Python
from pyspark.sql.types import (
StringType, BooleanType, IntegerType, FloatType, DateType
)
coltype_map = {
"Name": StringType(),
"Course_Name": StringType(),
"Duration_Months": IntegerType(),
"Course_Fees": FloatType(),
"Start_Date": DateType(),
"Payment_Done": BooleanType(),
}
# course_df6 has all the column
# types as string
course_df6 = course_df5.select(
[course_df5[c].cast(coltype_map[c])
.alias(c) for c in course_df5.columns]
)
course_df6.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: integer (nullable = true)
|-- Course_Fees: float (nullable = true)
|-- Start_Date: date (nullable = true)
|-- Payment_Done: boolean (nullable = true)
Method 3: Using spark.sql()
Here we will use SQL query to change the column type.
Syntax: spark.sql("sql Query")
Example: Using spark.sql()
Python
# course_df5 has all the column datatypes as string
course_df5.createOrReplaceTempView("course_view")
course_df7 = spark.sql('''
SELECT
Name,
Course_Name,
INT(Duration_Months),
FLOAT(Course_Fees),
DATE(Start_Date),
BOOLEAN(Payment_Done)
FROM course_view
''')
course_df7.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: integer (nullable = true)
|-- Course_Fees: float (nullable = true)
|-- Start_Date: date (nullable = true)
|-- Payment_Done: boolean (nullable = true)
Similar Reads
How to change dataframe column names in PySpark ?
In this article, we are going to see how to change the column names in the pyspark data frame. Let's create a Dataframe for demonstration: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark - example jo
3 min read
How to verify Pyspark dataframe column type ?
While working with a big Dataframe, Dataframe consists of any number of columns that are having different datatypes. For pre-processing the data to apply operations on it, we have to know the dimensions of the Dataframe and datatypes of the columns which are present in the Dataframe. In this article
4 min read
How to add a new column to a PySpark DataFrame ?
In this article, we will discuss how to add a new column to PySpark Dataframe. Create the first data frame for demonstration: Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose. Python3 # importing module import pyspark # importing spark
9 min read
How to name aggregate columns in PySpark DataFrame ?
In this article, we are going to see how to name aggregate columns in the Pyspark dataframe. We can do this by using alias after groupBy(). groupBy() is used to join two columns and it is used to aggregate the columns, alias is used to change the name of the new column which is formed by grouping da
2 min read
How to delete columns in PySpark dataframe ?
In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing
2 min read
How to Add Multiple Columns in PySpark Dataframes ?
In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi Python3 # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksessio
2 min read
Applying function to PySpark Dataframe Column
In this article, we're going to learn 'How we can apply a function to a PySpark DataFrame Column'. Apache Spark can be used in Python using PySpark Library. PySpark is an open-source Python library usually used for data analytics and data science. Pandas is powerful for data analysis but what makes
4 min read
How to add a constant column in a PySpark DataFrame?
In this article, we are going to see how to add a constant column in a PySpark Dataframe. It can be done in these ways: Using Lit()Using Sql query. Creating Dataframe for demonstration: Python3 # Create a spark session from pyspark.sql import SparkSession from pyspark.sql.functions import lit spark
2 min read
How to get name of dataframe column in PySpark ?
In this article, we will discuss how to get the name of the Dataframe column in PySpark. To get the name of the columns present in the Dataframe we are using the columns function through this function we will get the list of all the column names present in the Dataframe. Syntax: df.columns We can a
3 min read
How to show full column content in a PySpark Dataframe ?
Sometimes in Dataframe, when column data containing the long content or large sentence, then PySpark SQL shows the dataframe in compressed form means the first few words of the sentence are shown and others are followed by dots that refers that some more data is available. From the above sample Data
5 min read