Pass multiple columns in UDF in Pyspark
Last Updated :
28 Apr, 2025
In this article, we are going to learn how to pass multiple columns in UDF using Pyspark in Python.
Pyspark has numerous types of functions, such as string functions, sort functions, Window functions, etc. but do you know Pyspark has also one of the most essential types of functions, i.e., User Defined Function or UDF? UDF is a crucial feature of Spark SQL and data frame that is used to extend Pyspark's built-in capabilities. UDF also gives you the feature to not only pass one column but multiple columns. In this article, we will discuss the same.
Methods to pass multiple columns in UDF:
- Simple Approach
- Approach using struct
- Approach using array
Method 1: Simple Approach
In this method, we are going to make a data frame with three columns Roll_Number, Fees, and Fine, and then we are going to add a new column of "Total Amount" using udf() in which we are going to pass two column store the total of them in "Total Amount" and then using withColumn() adding "Total Amount" column in data frame.
Implementation:
Step 1: First of all, import the libraries, SparkSession, IntegerType, UDF, and array. The SparkSession library is used to create the session while IntegerType is used to convert internal SQL objects to native Python objects. The UDF library is used to create a reusable function in Pyspark.
Step 2: Now, create a spark session using getOrCreate() function and a function to be performed on the columns of the data frame.
Step 3: Pass multiple columns in UDF with parameters as the function created above on the data frame and IntegerType.
Step 4: Create the data frame and call the function created before with the struct to present the data frame with the new column.
Python3
# Pyspark program to pass multiple
# columns in UDF: Simple Approach
# Import the libraries SparkSession, IntegerType and udf libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Create a function to add two numbers
def sum(x, y):
return x + y
# Pass multiple columns in UDF
sum_cols = udf(sum, IntegerType())
# Create a data frame with three columns 'Roll_Number,' 'Fees' and 'Fine'
data_frame = spark_session.createDataFrame(
[(1, 10000, 400),(2, 14000, 500), (3, 12000, 800)],
['Roll_Number', 'Fees', 'Fine'])
# Display the data frame showing new column formed
# by calling sum function on columns 'Fees' and 'Fine'
data_frame.withColumn('Total Amount',
sum_cols('Fees', 'Fine')).show()
Output:
Method 2: Approach using struct
In this method, we are going to do the same thing as in the above method but in this method, we are going to use struct to pass multiple columns.
Implementation:
Step 1: First of all, import the libraries, SparkSession, IntegerType, UDF, and array. The SparkSession library is used to create the session while IntegerType is used to convert internal SQL objects to native Python objects. The UDF library is used to create a reusable function in Pyspark while the struct library is used to create a new struct column.
Step 2: Create a spark session using getOrCreate() function and pass multiple columns in UDF with parameters as the function to be performed on the data frame and IntegerType.
Step 3: Create the data frame and call the function created before with the struct to present the data frame with the new column.
Python3
# Pyspark program to pass multiple
# columns in UDF: Approach using struct
# Import the libraries SparkSession,
# IntegerType, struct and udf libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf, struct
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Pass multiple columns in UDF by adding the numbers
sum_cols = udf(lambda x: x[0]+x[1], IntegerType())
# Create a data frame with three columns 'Roll_Number,' 'Fees' and 'Fine'
data_frame = spark_session.createDataFrame(
[(1, 10000, 400), (2, 14000, 500), (3, 12000, 800)],
['Roll_Number', 'Fees', 'Fine'])
# Display the data frame showing new column formed
# by calling sum function inside struct on columns 'Fees' and 'Fine'
data_frame.withColumn('Total Amount',
sum_cols(struct('Fees', 'Fine'))).show()
Output:
Method 3: Approach using an array
In this method, the final output is the same as above but in this, we are using an array to pass multiple columns using the udf() function by applying the sum operation on the columns that we are passing.
Implementation:
Step 1: First of all, import the libraries, SparkSession, IntegerType, UDF, and array. The SparkSession library is used to create the session while IntegerType is used to convert internal SQL objects to native Python objects. The UDF library is used to create a reusable function in Pyspark while the array library is used to create a new array column.
Step 2: Create a spark session using getOrCreate() function and pass multiple columns in UDF with parameters as inbuilt function to be performed on the data frame and IntegerType.
Step 3: Create the data frame and call the function created before with the array to present the data frame with the new column.
Python3
# Pyspark program to pass multiple
# columns in UDF: Approach using array
# Import the libraries SparkSession, IntegerType, array and udf libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf, array
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Pass multiple columns in UDF by calling the inbuilt sum function
sum_cols = udf(lambda arr1: sum(arr1), IntegerType())
# Create a data frame with three columns 'Roll_Number,' 'Fees' and 'Fine'
# and display the data frame showing new column formed
# by calling sum function inside array on columns 'Fees' and 'Fine'
spark_session.createDataFrame(
[(1, 10000, 400), (2, 14000, 500), (3, 12000, 800)],
['Roll_Number', 'Fees', 'Fine']).withColumn('Total Amount',
sum_cols(array('Fees', 'Fine'))).show()
Output:
Similar Reads
Python Tutorial - Learn Python Programming Language Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly. It'sA high-level language, used in web development, data science, automation, AI and more.Known fo
10 min read
Python Interview Questions and Answers Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Non-linear Components In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Python OOPs Concepts Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. By understanding the core OOP principles (classes, objects, inheritance, encapsulation, polymorphism, and abstraction), programmers can leverage the full p
11 min read
Python Projects - Beginner to Advanced Python is one of the most popular programming languages due to its simplicity, versatility, and supportive community. Whether youâre a beginner eager to learn the basics or an experienced programmer looking to challenge your skills, there are countless Python projects to help you grow.Hereâs a list
10 min read
Python Exercise with Practice Questions and Solutions Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test
9 min read
Python Programs Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co
11 min read
Spring Boot Tutorial Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Python Introduction Python was created by Guido van Rossum in 1991 and further developed by the Python Software Foundation. It was designed with focus on code readability and its syntax allows us to express concepts in fewer lines of code.Key Features of PythonPythonâs simple and readable syntax makes it beginner-frien
3 min read
Python Data Types Python Data types are the classification or categorization of data items. It represents the kind of value that tells what operations can be performed on a particular data. Since everything is an object in Python programming, Python data types are classes and variables are instances (objects) of thes
9 min read