Remove duplicates from a dataframe in PySpark

In this article, we are going to drop the duplicate data from dataframe using pyspark in Python

Before starting we are going to create Dataframe for demonstration:

Python3

# importing module
import pyspark

# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession

# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list  of employee data 
data =[["1","sravan","company 1"],
       ["2","ojaswi","company 1"],
       ["3","rohith","company 2"],
       ["4","sridevi","company 1"],
       ["1","sravan","company 1"],
       ["4","sridevi","company 1"]]

# specify column names
columns = ['Employee ID','Employee NAME','Company']

# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)

print('Actual data in dataframe')
dataframe.show()

Output:

Method 1: Using distinct() method

It will remove the duplicate rows in the dataframe

Syntax: dataframe.distinct()

Where, dataframe is the dataframe name created from the nested lists using pyspark

Example 1: Python program to drop duplicate data using distinct() function

Python3

print('distinct data after dropping duplicate rows')

# display distinct data
dataframe.distinct().show()

Output:

Example 2: Python program to select distinct data in only two columns.

We can use select () function along with distinct function to get distinct values from particular columns

Syntax: dataframe.select(['column 1','column n']).distinct().show()

Python3

# display distinct data in
# Employee ID and Employee NAME 
dataframe.select(['Employee ID',
                  'Employee NAME']).distinct().show()

Output:

Method 2: Using dropDuplicates() method

Syntax: dataframe.dropDuplicates()

where, dataframe is the dataframe name created from the nested lists using pyspark

Example 1: Python program to remove duplicate data from the employee table.

Python3

# remove duplicate data
# using dropDuplicates()function
dataframe.dropDuplicates().show()

Output:

Example 2: Python program to remove duplicate values in specific columns

Python3

# remove duplicate data
# using dropDuplicates()function 
# in two columns
dataframe.select(['Employee ID',
                  'Employee NAME']).dropDuplicates().show()

Output:

Remove duplicates from a dataframe in PySpark

Method 1: Using distinct() method

Method 2: Using dropDuplicates() method

Explore