Open In App

How to sort by value in PySpark?

Last Updated : 04 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In this article, we are going to sort by value in PySpark.

Creating RDD for demonstration:

Python
from pyspark.sql import SparkSession, Row

# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# create 2 Rows with 3 columns
data = Row(First_name="Sravan", Last_name="Kumar", age=23),
Row(First_name="Ojaswi", Last_name="Pinkey", age=16),
Row(First_name="Rohith", Last_name="Devi", age=7)

# create row on rdd
rdd = spark.sparkContext.parallelize(data)

rdd.collect()

Output:

[Row(First_name='Sravan', Last_name='Kumar', age=23),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7)]

Method 1: Using sortBy()

sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd.

Syntax:

rdd.sortBy(lambda expression)

It uses a lambda expression to sort the data based on columns.

lambda expression: lambda x: x[column_index]

Example 1: Sort the data by values based on column 1

Python
# sort the data by values based on column 1
rdd.sortBy(lambda x: x[0]).collect()

Output:

[Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Sravan', Last_name='Kumar', age=23)]

Example 2: Sort data based on column 2

Python
# sort the data by values based on column 2
rdd.sortBy(lambda x: x[1]).collect()

Output:

[Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Sravan', Last_name='Kumar', age=23)]

Method 2: Using takeOrdered()

It is the method available in RDD, this is used to sort values based on values in a particular column.

Syntax:

rdd.takeOrdered(n,lambda expression)

where, n is the total rows to be displayed after sorting

Sort values based on a particular column using takeOrdered function

Python
# sort values based on
# column 1 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[0]))

# sort values based on
# column 3 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[2]))

Output:

[Row(First_name='Ojaswi', Last_name='Pinkey', age=16), Row(First_name='Rohith', Last_name='Devi', age=7), Row(First_name='Sravan', Last_name='Kumar', age=23)]

[Row(First_name='Rohith', Last_name='Devi', age=7), Row(First_name='Ojaswi', Last_name='Pinkey', age=16), Row(First_name='Sravan', Last_name='Kumar', age=23)]


Next Article
Article Tags :
Practice Tags :

Similar Reads