How to add a new column to a PySpark DataFrame ?
Last Updated :
13 Jan, 2022
In this article, we will discuss how to add a new column to PySpark Dataframe.
Create the first data frame for demonstration:
Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose.
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
|
Output:

Method 1: Add New Column With Constant Value
In this approach to add a new column with constant values, the user needs to call the lit() function parameter of the withColumn() function and pass the required parameters into these functions. Here, the lit() is available in pyspark.sql. Functions module.
Syntax:
dataframe.withColumn("column_name", lit(value))
where,
- dataframe is the pyspark input dataframe
- column_name is the new column to be added
- value is the constant value to be assigned to this column
Example:
In this example, we add a column named salary with a value of 34000 to the above dataframe using the withColumn() function with the lit() function as its parameter in the python programming language.
Python3
import pyspark
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.withColumn( "salary" , lit( 34000 )).show()
|
Output:

Method 2: Add Column Based on Another Column of DataFrame
Under this approach, the user can add a new column based on an existing column in the given dataframe.
Example 1: Using withColumn() method
Here, under this example, the user needs to specify the existing column using the withColumn() function with the required parameters passed in the python programming language.
Syntax:
dataframe.withColumn("column_name", dataframe.existing_column)
where,
- dataframe is the input dataframe
- column_name is the new column
- existing_column is the column which is existed
In this example, we are adding a column named salary from the ID column with multiply of 2300 using the withColumn() method in the python language,
Python3
import pyspark
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.withColumn( "salary" , dataframe. ID * 2300 ).show()
|
Output:

Example 2 : Using concat_ws()
Under this example, the user has to concat the two existing columns and make them as a new column by importing this method from pyspark.sql.functions module.
Syntax:
dataframe.withColumn(“column_name”, concat_ws(“Separator”,”existing_column1″,’existing_column2′))
where,
- dataframe is the input dataframe
- column_name is the new column name
- existing_column1 and existing_column2 are the two columns to be added with Separator to make values to the new column
- Separator is like the operator between values with two columns
Example:
In this example, we add a column named Details from Name and Company columns separated by “-” in the python language.
Python3
import pyspark
from pyspark.sql.functions import concat_ws
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.withColumn( "Details" , concat_ws( "-" , "NAME" , 'Company' )).show()
|
Output:

Method 3: Add Column When not Exists on DataFrame
In this method, the user can add a column when it is not existed by adding a column with the lit() function and checking using if the condition.
Syntax:
if 'column_name' not in dataframe.columns:
dataframe.withColumn("column_name",lit(value))
where,
- dataframe. columns are used to get the column names
Example:
In this example, we add a column of the salary to 34000 using the if condition with the withColumn() and the lit() function.
Python3
import pyspark
from pyspark.sql.functions import concat_ws, lit
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
if 'salary' not in dataframe.columns:
dataframe.withColumn( "salary" , lit( 34000 )).show()
|
Output:

Method 4: Add Column to DataFrame using select()
In this method, to add a column to a data frame, the user needs to call the select() function to add a column with lit() function and select() method. It will also display the selected columns.
Syntax:
dataframe.select(lit(value).alias("column_name"))
where,
- dataframe is the input dataframe
- column_name is the new column
Example:
In this example, we add a salary column with a constant value of 34000 using the select() function with the lit() function as its parameter.
Python3
import pyspark
from pyspark.sql.functions import concat_ws, lit
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.select(lit( 34000 ).alias( "salary" )).show()
|
Output:

Method 5: Add Column to DataFrame using SQL Expression
In this method, the user has to use SQL expression with SQL function to add a column. Before that, we have to create a temporary view, From that view, we have to add and select columns.
Syntax:
dataframe.createOrReplaceTempView("name")
spark.sql("select 'value' as column_name from view")
where,
- dataframe is the input dataframe
- name is the temporary view name
- sql function will take SQL expression as input to add a column
- column_name is the new column name
- value is the column value
Example:
Add new column named salary with 34000 value
Python3
import pyspark
from pyspark.sql.functions import concat_ws, lit
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.createOrReplaceTempView( "view" )
spark.sql( "select '34000' as salary from view" ).show()
|
Output:

Method 6: Add Column Value Based on Condition
Under this method, the user needs to use the when function along with withcolumn() method used to check the condition and add the column values based on existing column values. So we have to import when() from pyspark.sql.functions to add a specific column based on the given condition.
Syntax:
dataframe.withColumn(“column_name”,
when((dataframe.column_name condition1), lit(“value1”)).
when((dataframe.column_name condition2), lit(“value2”)).
———————
———————
when((dataframe.column_name conditionn), lit(“value3”)).
.otherwise(lit(“value”)) )
where,
- column_name is the new column name
- condition1 is the condition to check and assign value1 using lit() through when
- otherwise, it is the keyword used to check when no condition satisfies.
Example:
In this example, we add a new column named salary and add value 34000 when the name is sravan and add value 31000 when the name is ojsawi, or bobby otherwise adds 78000 using the when() and the withColumn() function.
Python3
import pyspark
from pyspark.sql.functions import when, lit
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.withColumn( "salary" ,
when((dataframe.NAME = = "sravan" ), lit( "34000" )).
when((dataframe.NAME = = "ojsawi" ) | (
dataframe.NAME = = "bobby" ), lit( "31000" ))
.otherwise(lit( "78000" ))).show()
|
Output:

Similar Reads
How to add a constant column in a PySpark DataFrame?
In this article, we are going to see how to add a constant column in a PySpark Dataframe. It can be done in these ways: Using Lit()Using Sql query. Creating Dataframe for demonstration: C/C++ Code # Create a spark session from pyspark.sql import SparkSession from pyspark.sql.functions import lit spa
2 min read
How to Add Multiple Columns in PySpark Dataframes ?
In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi C/C++ Code # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksess
2 min read
How to add column sum as new column in PySpark dataframe ?
In this article, we are going to see how to perform the addition of New columns in Pyspark dataframe by various methods. It means that we want to create a new column that will contain the sum of all values present in the given row. Now let's discuss the various methods how we add sum as new columns
4 min read
How to rename a PySpark dataframe column by index?
In this article, we are going to know how to rename a PySpark Dataframe column by index using Python. we can rename columns by index using Dataframe.withColumnRenamed() and Dataframe.columns[] methods. with the help of Dataframe.columns[] we get the name of the column on the particular index and the
2 min read
How to Change Column Type in PySpark Dataframe ?
In this article, we are going to see how to change the column type of pyspark dataframe. Creating dataframe for demonstration: Python Code # Create a spark session from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkExamples').getOrCreate() # Create a spark dataframe colu
4 min read
How to add multiple columns to a data.frame in R?
In R Language adding multiple columns to a data.frame can be done in several ways. Below, we will explore different methods to accomplish this, using some practical examples. We will use the base R approach, as well as the dplyr package from the tidyverse collection of packages. Understanding Data F
4 min read
How to change dataframe column names in PySpark ?
In this article, we are going to see how to change the column names in the pyspark data frame. Let's create a Dataframe for demonstration: C/C++ Code # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark - example
3 min read
How to name aggregate columns in PySpark DataFrame ?
In this article, we are going to see how to name aggregate columns in the Pyspark dataframe. We can do this by using alias after groupBy(). groupBy() is used to join two columns and it is used to aggregate the columns, alias is used to change the name of the new column which is formed by grouping da
2 min read
Add New Columns to Polars DataFrame
Polars is a fast DataFrame library implemented in Rust and designed to process large datasets efficiently. It is gaining popularity as an alternative to pandas, especially when working with large datasets or needing higher performance. One common task when working with DataFrames is adding new colum
3 min read
How to add Empty Column to Dataframe in Pandas?
Let's learn how to add an empty column to a data frame in pandas. The empty column can be represented using NaN (Not a Number), None, or an empty string. Adding empty columns is useful when you want to create placeholders for future data In Pandas, the assignment operator (=) is a straightforward wa
3 min read