Count rows based on condition in Pyspark Dataframe
Last Updated :
29 Jun, 2021
In this article, we will discuss how to count rows based on conditions in Pyspark dataframe.
For this, we are going to use these methods:
- Using where() function.
- Using filter() function.
Creating Dataframe for demonstration:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data
data =[["1","sravan","vignan"],
["2","ojaswi","vvit"],
["3","rohith","vvit"],
["4","sridevi","vignan"],
["1","sravan","vignan"],
["5","gnanesh","iit"]]
# specify column names
columns = ['ID','NAME','college']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
print('Actual data in dataframe')
dataframe.show()
Output:
Note: If we want to get all row count we can use count() function
Syntax: dataframe.count()
Where, dataframe is the pyspark input dataframe
Example: Python program to get all row count
Python3
print('Total rows in dataframe')
dataframe.count()
Output:
Total rows in dataframe
6
Method 1: using where()
where(): This clause is used to check the condition and give the results
Syntax: dataframe.where(condition)
Where the condition is the dataframe condition
Example 1: Condition to get rows in dataframe where ID =1
Python3
# condition to get rows in dataframe
# where ID =1
print('Total rows in dataframe where\
ID = 1 with where clause')
print(dataframe.where(dataframe.ID == '1').count())
print('They are ')
dataframe.where(dataframe.ID == '1').show()
Output:
Example 2: Condition to get rows in dataframe with multiple conditions.
Python3
# condition to get rows in dataframe
# where ID not equal to 1
print('Total rows in dataframe where\
ID except 1 with where clause')
print(dataframe.where(dataframe.ID != '1').count())
# condition to get rows in dataframe
# where college is equal to vignan
print('Total rows in dataframe where\
college is vignan with where clause')
print(dataframe.where(dataframe.college == 'vignan').count())
# condition to get rows in dataframe
# where id greater than 2
print('Total rows in dataframe where ID greater\
than 2 with where clause')
print(dataframe.where(dataframe.ID > 2).count())
Output:
Total rows in dataframe where ID except 1 with where clause
4
Total rows in dataframe where college is vignan with where clause
3
Total rows in dataframe where ID greater than 2 with where clause
3
Example 3: Python program for multiple conditions
Python3
# condition to get rows in dataframe
# where ID not equal to 1 and name is sridevi
print('Total rows in dataframe where ID \
not equal to 1 and name is sridevi')
print(dataframe.where((dataframe.ID != '1') &
(dataframe.NAME == 'sridevi')
).count())
# condition to get rows in dataframe
# where college is equal to vignan or iit
print('Total rows in dataframe where college is\
vignan or iit with where clause')
print(dataframe.where((dataframe.college == 'vignan') |
(dataframe.college == 'iit')).count())
Output:
Total rows in dataframe where ID not equal to 1 and name is sridevi
1
Total rows in dataframe where college is vignan or iit with where clause
4
Method 2: Using filter()
filter(): This clause is used to check the condition and give the results, Both are similar
Syntax: dataframe.filter(condition)
Example 1: Python program to get rows where id = 1
Python3
# condition to get rows in
# dataframe where ID =1
print('Total rows in dataframe where\
ID = 1 with filter clause')
print(dataframe.filter(dataframe.ID == '1').count())
print('They are ')
dataframe.filter(dataframe.ID == '1').show()
Output:
Example 2: Python program for multiple conditions
Python3
# condition to get rows in dataframe
# where ID not equal to 1 and name is sridevi
print('Total rows in dataframe where ID not\
equal to 1 and name is sridevi')
print(dataframe.filter((dataframe.ID != '1') &
(dataframe.NAME == 'sridevi')).count())
# condition to get rows in dataframe
# where college is equal to vignan or iit
print('Total rows in dataframe where college\
is vignan or iit with filter clause')
print(dataframe.filter((dataframe.college == 'vignan') |
(dataframe.college == 'iit')).count())
Output:
Total rows in dataframe where ID not equal to 1 and name is sridevi
1
Total rows in dataframe where college is vignan or iit with filter clause
4
Similar Reads
Count values by condition in PySpark Dataframe
In this article, we are going to count the value of the Pyspark dataframe columns by condition. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving a
3 min read
Split Spark DataFrame based on condition in Python
In this article, we are going to learn how to split data frames based on conditions using Pyspark in Python. Spark data frames are a powerful tool for working with large datasets in Apache Spark. They allow to manipulate and analyze data in a structured way, using SQL-like operations. Sometimes, we
5 min read
Delete rows in PySpark dataframe based on multiple conditions
In this article, we are going to see how to delete rows in PySpark dataframe based on multiple conditions. Method 1: Using Logical expression Here we are going to use the logical expression to filter the row. Filter() function is used to filter the rows from RDD/DataFrame based on the given conditio
2 min read
Selecting rows in pandas DataFrame based on conditions
Letâs see how to Select rows based on some conditions in Pandas DataFrame. Selecting rows based on particular column value using '>', '=', '=', '<=', '!=' operator. Code #1 : Selecting all the rows from the given dataframe in which 'Percentage' is greater than 80 using basic method. Python# im
6 min read
Drop Rows in PySpark DataFrame with Condition
In this article, we are going to drop the rows in PySpark dataframe. We will be considering most common conditions like dropping rows with Null values, dropping duplicate rows, etc. All these conditions use different functions and we will discuss them in detail.We will cover the following topics:Dro
4 min read
Pyspark - Filter dataframe based on multiple conditions
In this article, we are going to see how to Filter dataframe based on multiple conditions. Let's Create a Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an
3 min read
Filtering rows based on column values in PySpark dataframe
In this article, we are going to filter the rows based on column values in PySpark dataframe. Creating Dataframe for demonstration:Python3 # importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n
2 min read
PySpark dataframe add column based on other columns
In this article, we are going to see how to add columns based on another column to the Pyspark Dataframe. Creating Dataframe for demonstration: Here we are going to create a dataframe from a list of the given dataset. Python3 # Create a spark session from pyspark.sql import SparkSession spark = Spar
2 min read
Pyspark GroupBy DataFrame with Aggregation or Count
Pyspark is a powerful tool for working with large datasets in a distributed environment using Python. One of the most common tasks in data manipulation is grouping data by one or more columns. This can be accomplished using the groupBy() function in Pyspark, which allows you to group a DataFrame bas
3 min read
Get number of rows and columns of PySpark dataframe
In this article, we will discuss how to get the number of rows and the number of columns of a PySpark dataframe. For finding the number of rows and number of columns we will use count() and columns() with len() function respectively. df.count(): This function is used to extract number of rows from t
6 min read