How to sort by value in PySpark?
Last Updated :
04 Apr, 2025
In this article, we are going to sort by value in PySpark.
Creating RDD for demonstration:
Python
from pyspark.sql import SparkSession, Row
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# create 2 Rows with 3 columns
data = Row(First_name="Sravan", Last_name="Kumar", age=23),
Row(First_name="Ojaswi", Last_name="Pinkey", age=16),
Row(First_name="Rohith", Last_name="Devi", age=7)
# create row on rdd
rdd = spark.sparkContext.parallelize(data)
rdd.collect()
Output:
[Row(First_name='Sravan', Last_name='Kumar', age=23),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7)]
Method 1: Using sortBy()
sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd.
Syntax:
rdd.sortBy(lambda expression)
It uses a lambda expression to sort the data based on columns.
lambda expression: lambda x: x[column_index]
Example 1: Sort the data by values based on column 1
Python
# sort the data by values based on column 1
rdd.sortBy(lambda x: x[0]).collect()
Output:
[Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Sravan', Last_name='Kumar', age=23)]
Example 2: Sort data based on column 2
Python
# sort the data by values based on column 2
rdd.sortBy(lambda x: x[1]).collect()
Output:
[Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Sravan', Last_name='Kumar', age=23)]
Method 2: Using takeOrdered()
It is the method available in RDD, this is used to sort values based on values in a particular column.
Syntax:
rdd.takeOrdered(n,lambda expression)
where, n is the total rows to be displayed after sorting
Sort values based on a particular column using takeOrdered function
Python
# sort values based on
# column 1 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[0]))
# sort values based on
# column 3 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[2]))
Output:
[Row(First_name='Ojaswi', Last_name='Pinkey', age=16), Row(First_name='Rohith', Last_name='Devi', age=7), Row(First_name='Sravan', Last_name='Kumar', age=23)]
[Row(First_name='Rohith', Last_name='Devi', age=7), Row(First_name='Ojaswi', Last_name='Pinkey', age=16), Row(First_name='Sravan', Last_name='Kumar', age=23)]
Similar Reads
How to sort a Scala Map by value
In this article, we will learn how to sort a Scala Map by value. We can sort the map by key, from low to high or high to low, using sortBy method. Syntax: MapName.toSeq.sortBy(_._2):_* Let's try to understand it with better example. Example #1: Scala // Scala program to sort given map by value impor
2 min read
How to Sort a Set of Values in Python?
Sorting means arranging the set of values in either an increasing or decreasing manner. There are various methods to sort values in Python. We can store a set or group of values using various data structures such as list, tuples, dictionaries which depends on the data we are storing. We can sort val
7 min read
How to Use Sortby in Scala?
In this article, we will learn to use the sortBy function in Scala. The sortBy function is used to sort elements in a collection based on specified sorting criteria. Table of Content Using SortBy for Sorting an ArrayUsing SortBy for Sorting an Array in Reverse Order Using SortBy for Sorting By the S
4 min read
UDF to sort list in PySpark
The most useful feature of Spark SQL used to create a reusable function in Pyspark is known as UDF or User defined function in Python. The column type of the Pyspark can be String, Integer, Array, etc. There occurs some situations in which you have got ArrayType column in Pyspark data frame and you
3 min read
How to use Is Not in PySpark
Null values are undefined or empty data present in a dataframe. These null values may be added due to some errors in data transfer or technical glitches. We should identify null values and make necessary changes in the Dataframe to address null values. In this article, we will learn about the usage
4 min read
Query HIVE table in Pyspark
Hadoop Distributed File System (HDFS) is a distributed file system that provides high-throughput access to application data. In this article, we will learn how to create and query a HIVE table using Apache Spark, which is an open-source distributed computing system that is used to process large amou
4 min read
PySpark - Split dataframe by column value
A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python. There occurs various circumstances in which you need only particular rows in the data frame. For this, you need to split the data frame according to the column value. This can be achieved either
3 min read
How to Sort an Array in Scala?
Sorting arrays effectively is essential for many applications, regardless of whether you're working with texts, custom objects, or numerical data. Because Scala is a strong and expressive language, it provides a variety of array sorting methods that may be customized to fit various needs and situati
5 min read
How to read hive partitioned table via pyspark
Hive is built on Apache Hadoop and employments a high-level association and inquiry dialect called HiveQL (comparative to SQL) to recover large-scale information put away in disseminated capacity frameworks such as the Hadoop Dispersed Record Framework "HDFS". In this article, we will learn how to r
4 min read
PySpark - Select columns by type
In this article, we will discuss how to select columns by type in PySpark using Python. Let's create a dataframe for demonstrationPython3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # import data field types from pyspark.sql
4 min read