Calling another custom Python function from Pyspark UDF
Last Updated :
18 Mar, 2024
PySpark, often known as Python API for Apache Spark, was created for distributed data processing. It gives users the ability to efficiently and scalable do complex computations and transformations on large datasets. User-Defined Functions (UDFs), which let users create their unique functions and apply them to Spark DataFrames or RDDs, which is one of the main features of PySpark. Using UDFs, PySpark's capabilities may be expanded and customized to meet certain needs. In this article, we will learn how to call another custom Python function from Pyspark UDF.
Calling Another Custom Python Function from Pyspark UDF
Python-coded PySpark UDFs provide the ability to call other Python functions, whether they are built-in or user-defined functions from outside libraries. By enabling users to make use of existing Python code, this feature improves the modularity and reusability of UDFs. Within the distributed PySpark environment, users may easily implement their domain-specific logic, carry out challenging calculations, or use cutting-edge algorithms. Users may take advantage of the full potential of Python's vast ecosystem of libraries and features by invoking Python functions from PySpark UDFs.
Steps to Call another Custom Python Function from a PySpark UDF
Let us see a step-by-step process to call another custom Python function from a Pyspark UDF.
Step 1: Import the necessary modules
First, import the 'udf' from the 'pyspark.sql.functions' module, which offers tools for dealing with Spark DataFrames.
from pyspark.sql.functions import udf
Step 2: Start Spark Session
Next, create a spark session by importing the necessary spark modules.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Step 3: Create a Dataframe
The next step is to create a dataframe that will be used to perform the operations on in spark.
data = [("Marry", 25), ("Sunny", 30), ("Ram", 35)]
df = spark.createDataFrame(data, ["name", "age"])
Step 4: Define the custom Python function
Then define a custom Python function that we wish to invoke from the PySpark UDF. We can use any logic or calculations we need in this function. For example, a function to convert a string to an upper case string.
def to_uppercase(string):
return string.upper()
Step 5: Create a PySpark UDF
Use the UDF function from the 'pyspark.sql.functions' module to construct a PySpark UDF after creating the custom Python function. The 'udf()' function should receive the custom Python function as an argument. The custom function is registered as a UDF, so that it may be applied to DataFrame columns.
to_uppercase_udf = udf(to_uppercase)
Step 6: Apply the UDF to a DataFrame
After creating the PySpark UDF, use the 'withColumn()' function to apply it to a DataFrame column. In the DataFrame, this method adds a new column or deletes an existing column. Each row of the DataFrame will call the UDF once, applying the custom Python function to the designated column and producing the desired result.
df = df.withColumn("name_uppercase", to_uppercase_udf(df["name"]))
Step 7: Display the DataFrame
At last, we will use the 'show()' function to display the dataframe to see the changes made to it.
df.show()
By following these instructions, we can execute customized calculations and transformations on the PySpark DataFrames by calling another custom Python function from a PySpark UDF.
Example to Call another Custom Python Function from a PySpark UDF
Now, let us see a few different examples of calling Python custom functions from a Pyspark UDF.
Example 1: Converting a DataFrame column to uppercase
In this example, we will create a spark dataframe 'df' using Pyspark that contains names and ages of people. Then we will define a custom Python function 'to_uppercase()' which takes a Python String as an argument and converts it to the upper case and stores the result in a new column of that dataframe. Then we created Pyspark UDF using Pyspark's 'udf()' function.
Python3
# import modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
# Defining the custom Python function here
def to_uppercase(string):
return string.upper()
# Now we will create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Now create a DataFrame
data = [("Marry", 25), ("Sunny", 30), ("Ram", 35)]
df = spark.createDataFrame(data, ["name", "age"])
# Make a PySpark UDF now
to_uppercase_udf = udf(to_uppercase)
# Now Apply the UDF to the 'name' column.
df = df.withColumn("name_uppercase", to_uppercase_udf(df["name"]))
# Function to show the DataFrame
df.show()
Output:
Converting values to uppercaseExample 2: Calling a custom Python function that combines multiple DataFrame columns
In this example, we will create a dataframe that contains 2 columns - 'first_name' and 'last_name'. Then create a Python custom function 'combine_columns' which takes the 'first_name' and 'last_name' as parameters and returns a column that combines them together to create 'full_name''.
Python3
# import modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
# Defining the custom Python function
def combine_columns(col1, col2):
return col1 + " " + col2
# Now create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Now create a DataFrame
data = [("John", "Doe"), ("Ram", "Kumar"), ("Smith", "Jones")]
df = spark.createDataFrame(data, ["first_name", "last_name"])
# Make a PySpark UDF
combine_columns_udf = udf(combine_columns)
# Apply the UDF to the 'first_name' and 'last_name' columns
df = df.withColumn("full_name", combine_columns_udf(df["first_name"], df["last_name"]))
# Function to show the DataFrame
df.show()
Output:
Combining first and last namesExample 3: Calling a Custom Python Function from PySpark UDF with External Libraries
For more complex calculations, PySpark enables us to use external Python libraries within bespoke functions. Assume we wish to use the fuzzy matching library 'fuzzywuzzy' and a custom Python method named 'calculate_similarity' to compare the similarity between two texts.
In this example, we import the 'fuzz' module from the fuzzywuzzy library in Python and use the 'fuzz.ratio()' function to determine the degree of similarity between two texts. We create the unique Python method 'calculate_similarity()' to use the input strings to invoke the 'fuzz.ratio()' algorithm. Using the 'udf()' function, we build a UDF named 'similarity_udf' and define the input and output types. Finally, we use the 'withColumn()' method to apply the UDF to the 'string1' and 'string2' columns, and the resultant DataFrame with the similarity ratios is presented.
Python3
# import modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from fuzzywuzzy import fuzz
# Creating a SparkSession
spark = SparkSession.builder.getOrCreate()
# Sample DataFrame taken with columns 'string1' and 'string2'
data = [("apple", "apples"), ("banana", "bananas"), ("cat", "dog")]
df = spark.createDataFrame(data, ["string1", "string2"])
# Creating a Custom Python function
def calculate_similarity(str1, str2):
return fuzz.ratio(str1, str2)
# Creating a UDF from the custom function
similarity_udf = udf(calculate_similarity, IntegerType())
# Apply the UDF to calculate similarity
df.withColumn("similarity", similarity_udf(df["string1"], df["string2"])).show()
Output:
Finding the similarity between two textExample 4: Applying a Custom Python Function with Complex Logic
Let's look at an example where we have a DataFrame with a column of strings representing sentences and we want to use a custom Python function called 'count_words' to determine how many words are present in each phrase.
In this illustration, the custom Python function 'count_words' uses the 'split()' method to break the input text up into words and uses the 'len()' function to get the word count. Using the 'udf()' function, we build a UDF named 'count_udf' and define the input and output types. Finally, we use the 'withColumn()' method to apply the UDF to the "sentence" column, and the resultant DataFrame with the word counts is presented.
Python3
# import modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
# Creating a SparkSession
spark = SparkSession.builder.getOrCreate()
# Sample DataFrame with a column 'sentence'
data = [("Hello, PySpark!",), ("PySpark is great in today's world",),
("Spark DataFrames are powerful in python to work on",)]
df = spark.createDataFrame(data, ["sentence"])
# Creating a Custom Python function
def count_words(sentence):
return len(sentence.split())
# Creating a UDF from the custom function
count_udf = udf(count_words, IntegerType())
# Apply the UDF to count words in each sentence
df.withColumn("word_count", count_udf(df["sentence"])).show()
Output:
Counting the words present in a sentence
Similar Reads
Python - Call function from another file
Given a Python file, we need to call a function in it defined in any other Python file. Example: Suppose there is a file test.py which contains the definition of the function displayText(). #test.py>def displayText(): print( "Geeks 4 Geeks!")We need to call the function displayText() in any other
5 min read
Convert Python Functions into PySpark UDF
In this article, we are going to learn how to convert Python functions into Pyspark UDFs We will discuss the process of converting Python functions into PySpark User-Defined Functions (UDFs). PySpark UDFs are a powerful tool for data processing and analysis, as they allow for the use of Python funct
4 min read
Applying function to PySpark Dataframe Column
In this article, we're going to learn 'How we can apply a function to a PySpark DataFrame Column'. Apache Spark can be used in Python using PySpark Library. PySpark is an open-source Python library usually used for data analytics and data science. Pandas is powerful for data analysis but what makes
4 min read
Python PySpark sum() Function
PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data anal
3 min read
Convert PySpark DataFrame to Dictionary in Python
In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values. Before starting, we will create a sample Dataframe: C/C++ Code # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark
3 min read
Building a row from a dictionary in PySpark
In this article, we will discuss how to build a row from the dictionary in PySpark For doing this, we will pass the dictionary to the Row() method. Syntax: Syntax: Row(dict) Example 1: Build a row with key-value pair (Dictionary) as arguments. Here, we are going to pass the Row with Dictionary Synta
1 min read
Convert Python Dictionary List to PySpark DataFrame
In this article, we will discuss how to convert Python Dictionary List to Pyspark DataFrame. It can be done in these ways: Using Infer schema.Using Explicit schemaUsing SQL Expression Method 1: Infer schema from the dictionary We will pass the dictionary directly to the createDataFrame() method. Syn
3 min read
How to re-partition pyspark dataframe in Python
Are you a data science or machine learning enthusiast who likes to play with data? Have you ever got the need to repartition the Pyspark dataset you got? Got confused, about how to fulfill the demand? Don't worry! In this article, we will discuss the re-partitioning of the Pyspark data frame in Pyth
3 min read
Python PySpark - DataFrame filter on multiple columns
In this article, we are going to filter the dataframe on multiple columns by using filter() and where() function in Pyspark in Python. Creating Dataframe for demonestration: C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSes
2 min read
Converting a PySpark DataFrame Column to a Python List
In this article, we will discuss how to convert Pyspark dataframe column to a Python list. Creating dataframe for demonstration: C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an ap
5 min read