How to add column sum as new column in PySpark dataframe ?
Last Updated :
25 Aug, 2021
In this article, we are going to see how to perform the addition of New columns in Pyspark dataframe by various methods. It means that we want to create a new column that will contain the sum of all values present in the given row. Now let's discuss the various methods how we add sum as new columnsĀ
But first, let's create Dataframe for Demonstration
Python3
# import SparkSession from the pyspark
from pyspark.sql import SparkSession
# build and create the SparkSession
# with name "sum as new_col"
spark = SparkSession.builder.appName("sum as new_col").getOrCreate()
# Creating the Spark DataFrame
data = spark.createDataFrame([('x', 5, 3, 7),
('Y', 3, 3, 6),
('Z', 5, 2, 6)],
['A', 'B', 'C', 'D'])
# Print the schema of the DataFrame by
# printSchema()
data.printSchema()
# Showing the DataFrame
data.show()
Output:
Now we will see the different methods about how to add new columns in spark Dataframe .
Method 1: Using UDF
In this method, we will define the function which will take the column name as arguments and return the total sum of rows. By using UDF(User-defined Functions) Method which is used to make reusable function in spark. This function allows us to create the new function as per our requirements that's why this is also called a used defined function.
Now we define the datatype of the udf function and create the functions which will return the values which is the sum of all values in the row.
Python3
# import the functions as F from pyspark.sql
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
# define the sum_col
def sum_col(b, c, d):
col_sum = b+c+d
return col_sum
# integer datatype is defined
new_f = F.udf(sum_col, IntegerType())
# calling and creating the new
# col as udf_method_sum
df_col1 = data.withColumn("Udf_method_sum",
new_f("B", "C", "D"))
# Showing and printing the schema of the Dataframe
df_col1.printSchema()
df_col1.show()
Output:

Method 2: Using expr() function.
By using expr(str) the function which will take expressions argument as a string. There is another function in pyspark that will take mathematical expression as an argument in the form of string. For example, if you want the sum of rows then pass the arguments as 'n1+n2+n3+n4.......' where n1,n2,n3... are the column names
Python3
# import expr from the functions
from pyspark.sql.functions import expr
# create the new column as by withcolumn
# by giving argument as
# col_name ='expression_method_sum'
# and expr() function which
# will take expressions argument as string
df_col1 = df_col1.withColumn('expression_method_sum',
expr("B+C + D"))
# Showing and printing the schema of
# the Dataframe
df_col1.printSchema()
df_col1.show()
Output:Ā

Method 3: Using SQL operation
In this method first, we have to create the temp view of the table with the help of createTempView we can create the temporary view. The life of this temp is upto the life of the sparkSession
Then after creating the table select the table by SQL clause which will take all the values as a string Ā
Python3
# Creating the temporary view
# of the DataFrame as temp.
df_col1 = df_col1.createTempView("temp")
# By using sql clause creating
# new columns as sql_method
df_col1=spark.sql('select *, B+C+D as sql_method from temp')
# Printing the schema of the dataFrame
# and showing the DataFrame
df_col1.printScheam()
df_col1.show()
Output:Ā

Method 4: Using select()
Select table by using select() method and pass the arguments first one is the column name , or "*" for selecting the whole table and the second argument pass the names of the columns for the addition, and alias() function is used to give the name of the newly created column.
Python3
# select everything from table df_col1 and
# create new sum column as " select_method_sum".
df_col1 = df_col1.select('*',
(df_col1["B"]+df_col1["C"]+df_col1['D']).
alias("select_method_sum"))
# Showing the schema and table
df_col1.printSchema()
df_col1.show()
Output:

Method 5: Using withcolumn()
Ā WithColumn() is a transformation function of the dataframe which is used for changing values, change datatypes, and creating new columns from existing ones.
This function will arguments as new column name and column name for the summation.
Python3
# by using withcolumn function
df_col1 = df_col1.withColumn('withcolum_Sum',
data['B']+data['C']+data['D'])
# Showing and printing the schema
# of the Dataframe
df_col1.printSchema()
df_col1.show()
Output:Ā
Ā
Similar Reads
How to add a new column to a PySpark DataFrame ?
In this article, we will discuss how to add a new column to PySpark Dataframe. Create the first data frame for demonstration: Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose. Python3 # importing module import pyspark # importing spark
9 min read
PySpark dataframe add column based on other columns
In this article, we are going to see how to add columns based on another column to the Pyspark Dataframe. Creating Dataframe for demonstration: Here we are going to create a dataframe from a list of the given dataset. Python3 # Create a spark session from pyspark.sql import SparkSession spark = Spar
2 min read
How to add a constant column in a PySpark DataFrame?
In this article, we are going to see how to add a constant column in a PySpark Dataframe. It can be done in these ways: Using Lit()Using Sql query. Creating Dataframe for demonstration: Python3 # Create a spark session from pyspark.sql import SparkSession from pyspark.sql.functions import lit spark
2 min read
How to change dataframe column names in PySpark ?
In this article, we are going to see how to change the column names in the pyspark data frame. Let's create a Dataframe for demonstration: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark - example jo
3 min read
How to Add Multiple Columns in PySpark Dataframes ?
In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi Python3 # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksessio
2 min read
How to show full column content in a PySpark Dataframe ?
Sometimes in Dataframe, when column data containing the long content or large sentence, then PySpark SQL shows the dataframe in compressed form means the first few words of the sentence are shown and others are followed by dots that refers that some more data is available. From the above sample Data
5 min read
How to name aggregate columns in PySpark DataFrame ?
In this article, we are going to see how to name aggregate columns in the Pyspark dataframe. We can do this by using alias after groupBy(). groupBy() is used to join two columns and it is used to aggregate the columns, alias is used to change the name of the new column which is formed by grouping da
2 min read
Add new column with default value in PySpark dataframe
In this article, we are going to see how to add a new column with a default value in PySpark Dataframe. The three ways to add a column to PandPySpark as DataFrame with Default Value. Using pyspark.sql.DataFrame.withColumn(colName, col)Using pyspark.sql.DataFrame.select(*cols)Using pyspark.sql.SparkS
3 min read
How to Change Column Type in PySpark Dataframe ?
In this article, we are going to see how to change the column type of pyspark dataframe. Creating dataframe for demonstration: Python # Create a spark session from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkExamples').getOrCreate() # Create a spark dataframe columns =
4 min read
How to get name of dataframe column in PySpark ?
In this article, we will discuss how to get the name of the Dataframe column in PySpark. To get the name of the columns present in the Dataframe we are using the columns function through this function we will get the list of all the column names present in the Dataframe. Syntax: df.columns We can a
3 min read