In this article, we are going to drop the duplicate data from dataframe using pyspark in Python
Before starting we are going to create Dataframe for demonstration:
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data =[["1","sravan","company 1"],
["2","ojaswi","company 1"],
["3","rohith","company 2"],
["4","sridevi","company 1"],
["1","sravan","company 1"],
["4","sridevi","company 1"]]
# specify column names
columns = ['Employee ID','Employee NAME','Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
print('Actual data in dataframe')
dataframe.show()
Output:
Method 1: Using distinct() method
It will remove the duplicate rows in the dataframe
Syntax: dataframe.distinct()
Where, dataframe is the dataframe name created from the nested lists using pyspark
Example 1: Python program to drop duplicate data using distinct() function
print('distinct data after dropping duplicate rows')
# display distinct data
dataframe.distinct().show()
Output:
Example 2: Python program to select distinct data in only two columns.
We can use select () function along with distinct function to get distinct values from particular columns
Syntax: dataframe.select(['column 1','column n']).distinct().show()
# display distinct data in
# Employee ID and Employee NAME
dataframe.select(['Employee ID',
'Employee NAME']).distinct().show()
Output:

Method 2: Using dropDuplicates() method
Syntax: dataframe.dropDuplicates()
where, dataframe is the dataframe name created from the nested lists using pyspark
Example 1: Python program to remove duplicate data from the employee table.
# remove duplicate data
# using dropDuplicates()function
dataframe.dropDuplicates().show()
Output:
Example 2: Python program to remove duplicate values in specific columns
# remove duplicate data
# using dropDuplicates()function
# in two columns
dataframe.select(['Employee ID',
'Employee NAME']).dropDuplicates().show()
Output: