How to Iterate over rows and columns in PySpark dataframe
Last Updated :
22 Dec, 2022
In this article, we will discuss how to iterate rows and columns in PySpark dataframe.
Create the dataframe for demonstration:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 1"],
["3", "rohith", "company 2"],
["4", "sridevi", "company 1"],
["5", "bobby", "company 1"]]
# specify column names
columns = ['ID', 'NAME', 'Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
Output:

Method 1: Using collect()
This method will collect all the rows and columns of the dataframe and then loop through it using for loop. Here an iterator is used to iterate over a loop from the collected elements using the collect() method.
Syntax:
for itertator in dataframe.collect():
print(itertator["column_name"],...............)
where,
- dataframe is the input dataframe
- iterator is used to collect rows
- column_name is the column to iterate rows
Example: Here we are going to iterate all the columns in the dataframe with collect() method and inside the for loop, we are specifying iterator['column_name'] to get column values.
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 1"],
["3", "rohith", "company 2"],
["4", "sridevi", "company 1"],
["5", "bobby", "company 1"]]
# specify column names
columns = ['ID', 'NAME', 'Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# using collect
for i in dataframe.collect():
# display
print(i["ID"], i["NAME"], i["Company"])
Output:

Method 2: Using toLocalIterator()
It will return the iterator that contains all rows and columns in RDD. It is similar to the collect() method, But it is in rdd format, so it is available inside the rdd method. We can use the toLocalIterator() with rdd like:
dataframe.rdd.toLocalIterator()
For iterating the all rows and columns we are iterating this inside an for loop
Syntax:
for itertator in dataframe.rdd.toLocalIterator():
print(itertator["column_name"],...............)
where,
- dataframe is the input dataframe
- iterator is used to collect rows
- column_name is the column to iterate rows
Example: Here we are going to iterate all the columns in the dataframe with toLocalIterator() method and inside the for loop, we are specifying iterator['column_name'] to get column values.
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 1"],
["3", "rohith", "company 2"],
["4", "sridevi", "company 1"],
["5", "bobby", "company 1"]]
# specify column names
columns = ['ID', 'NAME', 'Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# using toLocalIterator()
for i in dataframe.rdd.toLocalIterator():
# display
print(i["ID"], i["NAME"], i["Company"])
Output:

Method 3: Using iterrows()
This will iterate rows. Before that, we have to convert our PySpark dataframe into Pandas dataframe using toPandas() method. This method is used to iterate row by row in the dataframe.
Syntax: dataframe.toPandas().iterrows()
Example: In this example, we are going to iterate three-column rows using iterrows() using for loop.
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 1"],
["3", "rohith", "company 2"],
["4", "sridevi", "company 1"],
["5", "bobby", "company 1"]]
# specify column names
columns = ['ID', 'NAME', 'Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# using iterrows()
for index, row in dataframe.toPandas().iterrows():
# display with index
print(row[0], row[1], row[2])
Output:

Method 4: Using select()
The select() function is used to select the number of columns. we are then using the collect() function to get the rows through for loop.
The select method will select the columns which are mentioned and get the row data using collect() method. This method will collect rows from the given columns.
Syntax: dataframe.select("column1",............,"column n").collect()
Example: Here we are going to select ID and Name columns from the given dataframe using the select() method
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 1"],
["3", "rohith", "company 2"],
["4", "sridevi", "company 1"],
["5", "bobby", "company 1"]]
# specify column names
columns = ['ID', 'NAME', 'Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# select only id and company
for rows in dataframe.select("ID", "Name").collect():
# display
print(rows[0], rows[1])
Output:

This will act as a loop to get each row and finally we can use for loop to get particular columns, we are going to iterate the data in the given column using the collect() method through rdd.
Syntax: dataframe.rdd.collect()
Example: Here we are going to iterate rows in NAME column.
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 1"],
["3", "rohith", "company 2"],
["4", "sridevi", "company 1"],
["5", "bobby", "company 1"]]
# specify column names
columns = ['ID', 'NAME', 'Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# select name column
for i in [j["NAME"] for j in dataframe.rdd.collect()]:
print(i)
Output:
sravan
ojaswi
rohith
sridevi
bobby
Method 6: Using map()
In this method, we will use map() function, which returns a new vfrom a given dataframe or RDD. Â The map() function is used with the lambda function to iterate through each row of the pyspark Dataframe.
For looping through each row using map() first we have to convert the PySpark dataframe into RDD because map() is performed on RDD’s only, so first convert into RDD it then use map() in which, lambda function for iterating through each row and stores the new RDD in some variable then convert back that new RDD into Dataframe using toDF() by passing schema into it.
Syntax:
rdd=dataframe.rdd.map(lambda loop: (
loop["column1"],...,loop["columnn"]) )
rdd.toDF(["column1",.......,"columnn"]).collect()
Example: Here we are going to iterate ID and NAME column
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 1"],
["3", "rohith", "company 2"],
["4", "sridevi", "company 1"],
["5", "bobby", "company 1"]]
# specify column names
columns = ['ID', 'NAME', 'Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# select id and name column using map()
rdd = dataframe.rdd.map(lambda loop: (
loop["ID"], loop["NAME"]))
# convert to dataframe and display
rdd.toDF(["ID", "NAME"]).collect()
Output:
[Row(ID='1', NAME='sravan'),
Row(ID='2', NAME='ojaswi'),
Row(ID='3', NAME='rohith'),
Row(ID='4', NAME='sridevi'),
Row(ID='5', NAME='bobby')]
Similar Reads
How to select and order multiple columns in Pyspark DataFrame ?
In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy
2 min read
How to Add Multiple Columns in PySpark Dataframes ?
In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi Python3 # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksessio
2 min read
How to change dataframe column names in PySpark ?
In this article, we are going to see how to change the column names in the pyspark data frame. Let's create a Dataframe for demonstration: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark - example jo
3 min read
Get number of rows and columns of PySpark dataframe
In this article, we will discuss how to get the number of rows and the number of columns of a PySpark dataframe. For finding the number of rows and number of columns we will use count() and columns() with len() function respectively. df.count(): This function is used to extract number of rows from t
6 min read
How to get name of dataframe column in PySpark ?
In this article, we will discuss how to get the name of the Dataframe column in PySpark. To get the name of the columns present in the Dataframe we are using the columns function through this function we will get the list of all the column names present in the Dataframe. Syntax: df.columns We can a
3 min read
How to add a new column to a PySpark DataFrame ?
In this article, we will discuss how to add a new column to PySpark Dataframe. Create the first data frame for demonstration: Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose. Python3 # importing module import pyspark # importing spark
9 min read
How to See Record Count Per Partition in a pySpark DataFrame
The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. Whenever we upload any file in the Pyspark, it creates a partition of that data equal to the number of cores. The user can repartition that data and
4 min read
How to name aggregate columns in PySpark DataFrame ?
In this article, we are going to see how to name aggregate columns in the Pyspark dataframe. We can do this by using alias after groupBy(). groupBy() is used to join two columns and it is used to aggregate the columns, alias is used to change the name of the new column which is formed by grouping da
2 min read
How to Order Pyspark dataframe by list of columns ?
In this article, we are going to apply OrderBy with multiple columns over pyspark dataframe in Python. Ordering the rows means arranging the rows in ascending or descending order. Method 1: Using OrderBy() OrderBy() function is used to sort an object by its index value. Syntax: dataframe.orderBy(['
2 min read
How to Order PysPark DataFrame by Multiple Columns ?
In this article, we are going to order the multiple columns by using orderBy() functions in pyspark dataframe. Ordering the rows means arranging the rows in ascending or descending order, so we are going to create the dataframe using nested list and get the distinct data. orderBy() function that sor
2 min read