How to sort by value in PySpark?
Last Updated :
04 Apr, 2025
In this article, we are going to sort by value in PySpark.
Creating RDD for demonstration:
Python
from pyspark.sql import SparkSession, Row
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# create 2 Rows with 3 columns
data = Row(First_name="Sravan", Last_name="Kumar", age=23),
Row(First_name="Ojaswi", Last_name="Pinkey", age=16),
Row(First_name="Rohith", Last_name="Devi", age=7)
# create row on rdd
rdd = spark.sparkContext.parallelize(data)
rdd.collect()
Output:
[Row(First_name='Sravan', Last_name='Kumar', age=23),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7)]
Method 1: Using sortBy()
sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd.
Syntax:
rdd.sortBy(lambda expression)
It uses a lambda expression to sort the data based on columns.
lambda expression: lambda x: x[column_index]
Example 1: Sort the data by values based on column 1
Python
# sort the data by values based on column 1
rdd.sortBy(lambda x: x[0]).collect()
Output:
[Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Sravan', Last_name='Kumar', age=23)]
Example 2: Sort data based on column 2
Python
# sort the data by values based on column 2
rdd.sortBy(lambda x: x[1]).collect()
Output:
[Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Sravan', Last_name='Kumar', age=23)]
Method 2: Using takeOrdered()
It is the method available in RDD, this is used to sort values based on values in a particular column.
Syntax:
rdd.takeOrdered(n,lambda expression)
where, n is the total rows to be displayed after sorting
Sort values based on a particular column using takeOrdered function
Python
# sort values based on
# column 1 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[0]))
# sort values based on
# column 3 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[2]))
Output:
[Row(First_name='Ojaswi', Last_name='Pinkey', age=16), Row(First_name='Rohith', Last_name='Devi', age=7), Row(First_name='Sravan', Last_name='Kumar', age=23)]
[Row(First_name='Rohith', Last_name='Devi', age=7), Row(First_name='Ojaswi', Last_name='Pinkey', age=16), Row(First_name='Sravan', Last_name='Kumar', age=23)]
Similar Reads
UDF to sort list in PySpark The most useful feature of Spark SQL used to create a reusable function in Pyspark is known as UDF or User defined function in Python. The column type of the Pyspark can be String, Integer, Array, etc. There occurs some situations in which you have got ArrayType column in Pyspark data frame and you
3 min read
Query HIVE table in Pyspark Hadoop Distributed File System (HDFS) is a distributed file system that provides high-throughput access to application data. In this article, we will learn how to create and query a HIVE table using Apache Spark, which is an open-source distributed computing system that is used to process large amou
4 min read
PySpark - Split dataframe by column value A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python. There occurs various circumstances in which you need only particular rows in the data frame. For this, you need to split the data frame according to the column value. This can be achieved either
3 min read
How to read hive partitioned table via pyspark Hive is built on Apache Hadoop and employments a high-level association and inquiry dialect called HiveQL (comparative to SQL) to recover large-scale information put away in disseminated capacity frameworks such as the Hadoop Dispersed Record Framework "HDFS". In this article, we will learn how to r
4 min read
PySpark - Select columns by type In this article, we will discuss how to select columns by type in PySpark using Python. Let's create a dataframe for demonstrationPython3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # import data field types from pyspark.sql
4 min read
Convert pair to value using map() in Pyspark In this article, we are going to learn how to use map() to convert (key, value) pair to value and keys only using Pyspark in Python. PySpark is the Python library for Spark programming. It is an API for interacting with the Spark cluster using the Python programming language. PySpark provides a simp
3 min read