Sort Pyspark Dataframe Within Groups
Last Updated :
26 Apr, 2025
Are you the one who likes to play with data in Python, especially the Pyspark data set? Then, you might know about various functions which you can apply to the dataset. But do you know that you can even rearrange the data either in ascending or descending order after grouping the same columns on the Pyspark data frame? Want to know, how to achieve it? Read the article further to know more about it in detail.
Modules Required:
Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, it is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Methods to sort Pyspark data frame within groups
- Using sort function
- Using orderBy function
Method 1: Using sort() function
In this method, we are going to use sort() function to sort the data frame in Pyspark. This function takes the Boolean value as an argument to sort in ascending or descending order.
Syntax:
sort(x, decreasing, na.last)
Parameters:
- x: list of Column or column names to sort by
- decreasing: Boolean value to sort in descending order
- na.last: Boolean value to put NA at the end
Stepwise Implementation:
Step 1: First, import the required libraries, i.e. SparkSession, sum, and desc. The SparkSession library is used to create the session, the sum is used to sum the columns on which groupby is applied, while desc is used to sort the list in descending order.
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, desc
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file which you want to sort within groups.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Finally, make the group of a data frame using the groupby function and further arrange it in ascending or descending order using the sort function. Also, we have used the agg function for the columns on which groupby has to be applied.
data_frame.groupby("#column-name").agg(sum("#column-name").alias("#column-name")).sort(desc("#column-name")).show()
Example 1:
In this example, we took the data frame (link) i.e., the dataset of 3×6, which we have grouped through the class and name columns using groupby function and sorted in descending order through the class column using the sort and asc function. We have also used the agg and sum functions to do the sum of the marks column which are the same when groupby is applied.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc, sum , asc
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/marks_data.csv' , sep = ',' ,
inferSchema = True , header = True )
data_frame.groupby( "class" , "name" ).agg(
sum ( "marks" ).alias( "marks" )).sort(asc( "class" )).show()
|
Output:
Example 2:
In this example, we took the data frame (link) i.e., the dataset of 3×6, which we have grouped through the class and name columns using groupby function and sorted in descending order through the class column using the sort and desc function as well as sorted in ascending order through the column marks using the sort and asc function. We have also used the agg and sum functions to do the sum of the marks column which are the same when groupby is applied.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc, sum , asc
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/marks_data.csv' , sep = ',' ,
inferSchema = True , header = True )
data_frame.groupby( "class" , "name" ).agg( sum ( "marks" ).alias(
"marks" )).sort(desc( "class" ), asc( "marks" )).show()
|
Output:
Method 2: Using orderBy() function
In this method, we are going to use orderBy() function to sort the data frame in Pyspark. It is used to sort an object by its index value.
Syntax: DataFrame.orderBy(cols, args)
Parameters :
- cols: List of columns to be ordered
- args: Specifies the sorting order i.e (ascending or descending) of columns listed in cols
Return type: Returns a new DataFrame sorted by the specified columns.
Stepwise Implementation:
Step 1: First, import the required libraries, i.e. SparkSession, sum, and desc. The SparkSession library is used to create the session, while the sum is used to sum the columns on which groupby is applied. The desc is used to sort the list in descending order, while the col is used to return a column name based on the given column name.
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, desc, col
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file which you want to sort within groups.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Finally, make the group of a data frame using the groupby function and further arrange it in ascending or descending order using the orderBy function. Also, we have used the agg function for the columns on which groupby has to be applied.
data_frame.groupby("#column-name").agg(sum("#column-name").alias("#column-name")).orderBy(col("#column-name").desc()).show()
Example 1:
In this example, we took the data frame (link) i.e., the dataset of 3×6, which we have grouped through the class and name columns using groupby function and sorted in descending order through the class column using the orderBy and asc function. We have also used the agg and sum functions to do the sum of the marks column which are the same when groupby is applied.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum , asc, col
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/marks_data.csv' ,
sep = ',' , inferSchema = True , header = True )
data_frame.groupby( "class" , "name" ).agg(
sum ( "marks" ).alias( "marks" )).sort(col( "class" ).asc()).show()
|
Output:
Example 2:
In this example, we took the data frame (link) i.e., the dataset of 3×6, which we have grouped through the class and name columns using groupby function and sorted in descending order through the class column using the orderBy and desc function as well as sorted in ascending order through the column marks using the orderBy and asc function. We have also used the agg and sum functions to do the sum of the marks column which are the same when groupby is applied.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc, sum , asc, col
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/marks_data.csv' ,
sep = ',' , inferSchema = True , header = True )
data_frame.groupby( "class" , "name" ).agg( sum ( "marks" ).alias( "marks" )).sort(
col( "class" ).desc(), col( "marks" ).asc()).show()
|
Output:

Similar Reads
PySpark Dataframe Split
PySpark is an open-source library used for handling big data. It is an interface of Apache Spark in Python. It is fast and also provides Pandas API to give comfortability to Pandas users while using PySpark. Dataframe is a data structure in which a large amount or even a small amount of data can be
4 min read
PySpark - Random Splitting Dataframe
In this article, we are going to learn how to randomly split data frame using PySpark in Python. A tool created by Apache Spark Community to use Python with Spark altogether is known as Pyspark. Thus, while working for the Pyspark data frame, Sometimes we are required to randomly split that data fra
7 min read
Drop Rows in PySpark DataFrame with Condition
In this article, we are going to drop the rows in PySpark dataframe. We will be considering most common conditions like dropping rows with Null values, dropping duplicate rows, etc. All these conditions use different functions and we will discuss them in detail. We will cover the following topics:Dr
4 min read
How to create PySpark dataframe with schema ?
In this article, we will discuss how to create the dataframe with schema using PySpark. In simple words, the schema is the structure of a dataset or dataframe. Functions Used:FunctionDescriptionSparkSessionThe entry point to the Spark SQL.SparkSession.builder()It gives access to Builder API that we
3 min read
PySpark Row using on DataFrame and RDD
You can access the rows in the data frame like this: Attribute, dictionary value. Row allows you to create row objects using named arguments. A named argument cannot be omitted to indicate that the value is "none" or does not exist. In this case, you should explicitly set this to None. Subsequent ch
6 min read
Numbering Rows within Groups of DataFrame in R
In this article, we will discuss how to number rows within the group of the dataframe in the R programming language Method 1: Using ave() function Call the ave() function, which is a base function of the R language, and pass the required parameters to this function and this process will be leading t
2 min read
Pandas Groupby - Sort within groups
Pandas Groupby is used in situations where we want to split data and set into groups so that we can do various operations on those groups like - Aggregation of data, Transformation through some group computations or Filtration according to specific conditions applied on the groups. In similar ways,
2 min read
How to union multiple dataframe in PySpark?
In this article, we will discuss how to union multiple data frames in PySpark. Method 1: Union() function in pyspark The PySpark union() function is used to combine two or more data frames having the same structure or schema. This function returns an error if the schema of data frames differs from e
4 min read
PySpark DataFrame - Where Filter
In this article, we are going to see where filter in PySpark Dataframe. Where() is a method used to filter the rows from DataFrame based on the given condition. The where() method is an alias for the filter() method. Both these methods operate exactly the same. We can also apply single and multiple
3 min read
Get specific row from PySpark dataframe
In this article, we will discuss how to get the specific row from the PySpark dataframe. Creating Dataframe for demonstration: C/C++ Code # importing module import pyspark # importing sparksession # from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession # and giving an
4 min read