Apply same function to all fields of PySpark dataframe row
Last Updated :
26 Apr, 2025
Are you a data scientist or data analyst who handles a lot of data? Have you ever felt the need to apply the same function whether it is uppercase, lowercase, subtract, add, etc. to apply to all the fields of data frame rows? This is possible in Pyspark in not only one way but numerous ways. In this article, we will discuss all the ways to apply the same function to all fields of the PySpark data frame row.
Modules Required
Pyspark: The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Methods to apply the same function to all fields of PySpark data frame row:
Method 1: Using reduce function
Syntax:
updated_data_frame = (reduce( lambda traverse_df, col_name: traverse_df.withColumn(col_name, function_to_perform(col(col_name))), data_frame.columns, data_frame ))
Here,
- function_to_perform: It is the function that needs to be applied on all the data frame rows such as upper, lower, etc.
- data_frame: It is the data frame taken as input from the user.
student_data.csv file:

student_data.csv
Stepwise Implementation:
Step 1: First, import the required libraries, i.e. SparkSession, reduce, col, and upper. The SparkSession library is used to create the session, while reduce applies a particular function passed to all of the list elements mentioned in the sequence. The col is used to get the column name, while the upper is used to convert the text to upper case. Instead of upper, you can use any other function too that you want to apply on each row of the data frame.
from pyspark.sql import SparkSession
from functools import reduce
from pyspark.sql.functions import col, upper
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Next, apply a particular function passed as an argument to all the row elements of the data frame using reduce function.
updated_data_frame = (reduce(lambda traverse_df, col_name: traverse_df.withColumn(col_name, upper(col(col_name))), data_frame.columns, data_frame))
Step 5: Finally, display the updated data frame in the previous step.
updated_data_frame.show()
Example:
In this example, we have used the reduce function to make all the elements of rows of the data frame i.e., the dataset of 5×5 uppercase through the function upper.
Python3
from pyspark.sql import SparkSession
from functools import reduce
from pyspark.sql.functions import col, upper
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv( '/content/student_data.csv' ,
sep = ',' , inferSchema = True , header = True )
updated_data_frame = ( reduce ( lambda traverse_df, col_name: traverse_df.withColumn(col_name, upper(col(col_name))), data_frame.columns, data_frame))
updated_data_frame.show()
|
Output:
Method 2: Using for loop
Syntax:
for col_name in data_frame.columns:
data_frame = data_frame.withColumn(col_name, function_to_perform(col(col_name)))
Here,
- function_to_perform: It is the function that needs to be applied on all the data frame rows such as upper, lower, etc.
- data_frame: It is the data frame taken as input from the user.
Stepwise Implementation
Step 1: First, import the required libraries, i.e. SparkSession, reduce, col, and upper. The SparkSession library is used to create the session. The col is used to get the column name, while the upper is used to convert the text to upper case. Instead of upper, you can use any other function too that you want to apply on each row of the data frame.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Next, create a for loop to traverse all the elements and convert it to uppercase.
for col_name in data_frame.columns:
data_frame = data_frame.withColumn(col_name, upper(col(col_name)))
Step 5: Finally, display the updated data frame in the previous step.
data_frame.show()
Example:
In this example, we have used the for loop to make all the elements of rows of the data frame i.e., the dataset of 5×5 uppercase through the function upper.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv( '/content/student_data.csv' ,
sep = ',' , inferSchema = True , header = True )
for col_name in data_frame.columns:
data_frame = data_frame.withColumn(col_name, upper(col(col_name)))
data_frame.show()
|
Output:
Method 3: Using list comprehension
Syntax:
updated_data_frame = data_frame.select(*[function_to_perform(col(col_name)).name(col_name) for col_name in data_frame.columns])
Here,
- function_to_perform: It is the function that needs to be applied on all the data frame rows such as upper, lower, etc.
- data_frame: It is the data frame taken as input from the user.
Stepwise Implementation:
Step 1: First, import the required libraries, i.e. SparkSession, reduce, col, and upper. The SparkSession library is used to create the session. The col is used to get the column name, while the upper is used to convert the text to upper case. Instead of upper, you can use any other function too that you want to apply on each row of the data frame.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Next, create a list comprehension to traverse all the elements and convert it to uppercase.
updated_data_frame = data_frame.select(*[upper(col(col_name)).name(col_name) for col_name in data_frame.columns])
Step 5: Finally, display the updated data frame in the previous step.
updated_data_frame.show()
Example:
In this example, we have used list comprehension to make all the elements of rows of the data frame i.e., the dataset of 5×5 uppercase through the function upper.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv( '/content/student_data.csv' ,
sep = ',' , inferSchema = True , header = True )
updated_data_frame = data_frame.select( * [upper(col(col_name)).name(col_name) for col_name in data_frame.columns])
updated_data_frame.show()
|
Output:

Similar Reads
Apply function to each row of Spark DataFrame
Spark is an open-source, distributed computing system used for processing large data sets across a cluster of computers. It has become increasingly popular due to its ability to handle the big data processing in real-time. Spark's DataFrame API, which offers a practical and effective method for carr
8 min read
Apply function to every row in a Pandas DataFrame
Python is a great language for performing data analysis tasks. It provides a huge amount of Classes and functions which help in analyzing and manipulating data more easily. In this article, we will see how we can apply a function to every row in a Pandas Dataframe. Apply Function to Every Row in a P
7 min read
How to duplicate a row N time in Pyspark dataframe?
In this article, we are going to learn how to duplicate a row N times in a PySpark DataFrame. Method 1: Repeating rows based on column value In this method, we will first make a PySpark DataFrame using createDataFrame(). In our example, the column "Y" has a numerical value that can only be used here
4 min read
Applying function to PySpark Dataframe Column
In this article, we're going to learn 'How we can apply a function to a PySpark DataFrame Column'. Apache Spark can be used in Python using PySpark Library. PySpark is an open-source Python library usually used for data analytics and data science. Pandas is powerful for data analysis but what makes
4 min read
How to select a range of rows from a dataframe in PySpark ?
In this article, we are going to select a range of rows from a PySpark dataframe. It can be done in these ways: Using filter().Using where().Using SQL expression. Creating Dataframe for demonstration: C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from
3 min read
Apply function to all values in array column in PySpark
A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python. The columns on the Pyspark data frame can be of any type, IntegerType, StringType, ArrayType, etc. Do you know for an ArrayType column, you can apply a function to all the values in the array? Th
3 min read
Apply function to each row in Data.table in R
In this article, we are going to see how to apply functions to each row in the data.table in R Programming Language. For applying a function to each row of the given data.table, the user needs to call the apply() function which is the base function of R programming language, and pass the required pa
1 min read
PySpark - Split dataframe into equal number of rows
When there is a huge dataset, it is better to split them into equal chunks and then process each dataframe individually. This is possible if the operation on the dataframe is independent of the rows. Each chunk or equally split dataframe then can be processed parallel making use of the resources mor
3 min read
How to apply functions in a Group in a Pandas DataFrame?
In this article, let's see how to apply functions in a group in a Pandas Dataframe. Steps to be followed for performing this task are - Import the necessary libraries.Set up the data as a Pandas DataFrame.Use apply function to find different statistical measures like Rolling Mean, Average, Sum, Maxi
1 min read
How to select last row and access PySpark dataframe by index ?
In this article, we will discuss how to select the last row and access pyspark dataframe by index. Creating dataframe for demonstration: C/C++ Code # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and givi
2 min read