Open In App

Pyspark GroupBy DataFrame with Aggregation or Count

Last Updated : 23 Jun, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

Pyspark is a powerful tool for handling large datasets in a distributed environment using Python. One common operation when working with data is grouping it based on one or more columns. This can be easily done in Pyspark using the groupBy() function, which helps to aggregate or count values in each group.

In this article, we will explore how to use the groupBy() function in Pyspark for counting occurrences and performing various aggregation operations.

Syntax of groupBy()

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

Parameters:

  • by: The column(s) to group by, can be a single column, list, or a function.
  • axis: The axis to operate on, default is 0 (rows).
  • level: For multi-level index DataFrames, specify the level(s) to group by.
  • as_index: If True (default), the grouped column(s) become the index; otherwise, the original index is kept.
  • sort: If True (default), groups are sorted; False keeps original order.
  • group_keys: Includes group labels in the output, default is True.
  • squeeze: If True, reduces dimensionality to a DataFrame or Series.
  • kwargs: Extra parameters for aggregation functions like count(), sum(), etc.

Creating a Pyspark DataFrame 

Before performing the groupBy() operation, let's create a simple DataFrame containing some student data, including columns like ID, NAME, DEPT, and FEE.

Python
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('GroupByExample').getOrCreate()

data = [
    ["1", "sravan", "IT", 45000],
    ["2", "ojaswi", "CS", 85000],
    ["3", "rohith", "CS", 41000],
    ["4", "sridevi", "IT", 56000],
    ["5", "bobby", "ECE", 45000],
    ["6", "gayatri", "ECE", 49000],
    ["7", "gnanesh", "CS", 45000],
    ["8", "bhanu", "Mech", 21000]
]

columns = ['ID', 'NAME', 'DEPT', 'FEE']

dataframe = spark.createDataFrame(data, columns)

dataframe.show()

Output:

Pyspark groupBy DataFrame with aggregation or count
Snapshot of the dataframe

Pyspark groupBy with Count

To count the number of rows in each group, we can use the count() function. This method counts the occurrences of each unique value in the specified column.

Python
# Grouping by 'DEPT' and counting occurrences
dataframe.groupBy('DEPT').count().show()

Output:

Pyspark groupBy DataFrame with aggregation or count
Snapshot of the output

Explanation:

  • groupBy('DEPT'): Groups the data by the DEPT column.
  • count(): Counts the number of rows for each group (department).

Pyspark groupBy with Aggregation

You can apply various aggregation functions to your grouped data, such as sum(), max(), min(), mean(), etc.

Python
from pyspark.sql.functions import sum, max, min, mean, count

# Grouping by 'DEPT' and applying aggregation functions
dataframe.groupBy("DEPT").agg(
    max("FEE"), sum("FEE"),
    min("FEE"), mean("FEE"),
    count("FEE")
).show()

Output:

Pyspark groupBy DataFrame with aggregation or count
Snapshot of the output

Explanation:

  • groupBy("DEPT"): Groups the data by the DEPT column.
  • agg(): Applies the aggregation functions (max, sum, min, mean, count) on the FEE column for each group.

Article Tags :
Practice Tags :

Similar Reads