How to Write Spark UDF (User Defined Functions) in Python ?
Last Updated :
06 Jun, 2021
In this article, we will talk about UDF(User Defined Functions) and how to write these in Python Spark. UDF, basically stands for User Defined Functions. The UDF will allow us to apply the functions directly in the dataframes and SQL databases in python, without making them registering individually. It can also help us to create new columns to our dataframe, by applying a function via UDF to the dataframe column(s), hence it will extend our functionality of dataframe. It can be created using the udf() method.
udf(): This method will use the lambda function to loop over data, and its argument will accept the lambda function, and the lambda value will become an argument for the function, we want to make as a UDF.
Sample Pyspark Dataframe
Let’s create a dataframe, and the theme of this dataframe is going to be the name of the student, along with his/her raw scores in a test out of 100.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType,StringType
from pyspark.sql.functions import udf
spark = SparkSession.builder.appName( 'UDF PRACTICE' ).getOrCreate()
cms = [ "Name" , "RawScore" ]
data = [( "Jack" , "79" ),
( "Mira" , "80" ),
( "Carter" , "90" )]
df = spark.createDataFrame(data = data,schema = cms)
df.show()
|
Output:

Creating Sample Function
Now, we have to make a function. So, for understanding, we will make a simple function that will split the columns and check, that if the traversing object in that column(is getting equal to ‘J'(Capital J) or ‘C'(Capital C) or ‘M'(Capital M), so it will be converting the second letter of that word, with its capital version. The implementation of this code is:
Python3
def Converter( str ):
result = ""
a = str .split( " " )
for q in a:
if q = = 'J' or 'C' or 'M' :
result + = q[ 1 : 2 ].upper()
return result
|
Making UDF from Sample function
Now, we will convert it to our UDF function, which will, in turn, reduce our workload on data. For this, we are using lambda inside UDF.
Python3
NumberUDF = udf( lambda m: Converter(m))
|
Using UDF over Dataframe
The next thing we will use here, is the withcolumn(), remember that withcolumn() will return a full dataframe. So we will use our existing df dataframe only, and the returned value will be stored in df only(basically we will append it).
Python3
df.withColumn( "Special Names" , NumberUDF( "Name" )).show()
|
Output:

Note: We can also do this all stuff in one step.
UDF with annotations
Now, a short and smart way of doing this is to use “ANNOTATIONS”(or decorators). This will create our UDF function in less number of steps. For this, all we have to do use @ sign(decorator) in front of udf function, and give the return type of the function in its argument part,i.e assign returntype as Intergertype(), StringType(), etc.
Python3
@udf (returnType = StringType())
def Converter( str ):
result = ""
a = str .split( " " )
for q in a:
if q = = 'J' or 'C' or 'M' :
result + = q[ 1 : 2 ].upper()
else :
result + = q
return result
df.withColumn( "Special Names" , Converter( "Name" )) \
.show()
|
Output:

Example:
Now, let’s suppose there is a marking scheme in the school that calibrates the marks of students in terms of its square root added 3(i.e they will be calibrating the marks out of 15). So, we will define a UDF function, and we will specify the return type this time. i.e float data type. So, declaration of this function will be–
Python3
def SQRT(x):
return float (math.sqrt(x) + 3 )
|
Now, we will define an udf, whose return type will always be float,i.e we are forcing the function, as well as the UDF to give us result in terms of floating-point numbers only. The definition of this function will be –
Python3
UDF_marks = udf( lambda m: SQRT(m),FloatType())
|
The second parameter of udf,FloatType() will always force UDF function to return the result in floatingtype only. Now, we will use our udf function, UDF_marks on the RawScore column in our dataframe, and will produce a new column by the name of”<lambda>RawScore”, and this will be a default naming of this column. The code for this will look like –
Python3
df.select( "Name" , "RawScore" , UDF_marks( "RawScore" )).show()
|
Output:

Similar Reads
Convert Python Functions into PySpark UDF
In this article, we are going to learn how to convert Python functions into Pyspark UDFs We will discuss the process of converting Python functions into PySpark User-Defined Functions (UDFs). PySpark UDFs are a powerful tool for data processing and analysis, as they allow for the use of Python funct
4 min read
How to bind arguments to given values in Python functions?
In Python, binding arguments to specific values can be a powerful tool, allowing you to set default values for function parameters, create specialized versions of functions, or partially apply a function to a set of arguments. This technique is commonly known as "partial function application" and ca
3 min read
How to use if, else & elif in Python Lambda Functions
Lambda function can have multiple parameters but have only one expression. This one expression is evaluated and returned. Thus, We can use lambda functions as a function object. In this article, we will learn how to use if, else & elif in Lambda Functions. Using if-else in lambda functionThe lam
2 min read
How to define a mathematical function in SymPy?
SymPy is a Python Library that makes 'Symbolic Computation' possible in Python. Mathematical Functions using SymPy We can define mathematical functions by using SymPy in Python. There are two types of functions that we can define with the help of SymPy: 'Undefined Functions' and 'Custom Functions'.
4 min read
Store Functions in List and Call in Python
In Python, a list of functions can be created by defining the tasks and then adding them to a list. Hereâs a simple example to illustrate how to do this: [GFGTABS] Python def say_hello(): return "Hello!" #Store a function "say_hello" in a list greetings = [say_hello] #Call the fi
3 min read
Apply same function to all fields of PySpark dataframe row
Are you a data scientist or data analyst who handles a lot of data? Have you ever felt the need to apply the same function whether it is uppercase, lowercase, subtract, add, etc. to apply to all the fields of data frame rows? This is possible in Pyspark in not only one way but numerous ways. In this
6 min read
Defining a Python function at runtime
One amazing feature of Python is that it lets us create functions while our program is running, instead of just defining them beforehand. This makes our code more flexible and easier to manage. Itâs especially useful for things like metaprogramming, event-driven systems and running code dynamically
3 min read
How to re-partition pyspark dataframe in Python
Are you a data science or machine learning enthusiast who likes to play with data? Have you ever got the need to repartition the Pyspark dataset you got? Got confused, about how to fulfill the demand? Don't worry! In this article, we will discuss the re-partitioning of the Pyspark data frame in Pyth
3 min read
How to use Is Not Null in PySpark
In data processing, handling null values is a crucial task to ensure the accuracy and reliability of the analysis. PySpark, the Python API for Apache Spark, provides powerful methods to handle null values efficiently. In this article, we will go through how to use the isNotNull method in PySpark to
4 min read
Split Spark DataFrame based on condition in Python
In this article, we are going to learn how to split data frames based on conditions using Pyspark in Python. Spark data frames are a powerful tool for working with large datasets in Apache Spark. They allow to manipulate and analyze data in a structured way, using SQL-like operations. Sometimes, we
5 min read