Find duplicate rows in a Dataframe based on all or selected columns
Last Updated :
04 Dec, 2023
Duplicating rows in a DataFrame involves creating identical copies of existing rows within a tabular data structure, such as a pandas DataFrame, based on specified conditions or across all columns. This process allows for the replication of data to meet specific analytical or processing requirements. In this article, we will be discussing how to find duplicate rows in a Dataframe based on all or a list of columns. For this, we will use Dataframe.duplicated() method of Pandas.
Creating a Sample Pandas DataFrame
Let's create a simple Dataframe with a dictionary of lists, say column names are: ‘Name’, ‘Age’, and ‘City’.
Python3
# Import pandas library
import pandas as pd
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
('Saumya', 32, 'Delhi'),
('Aaditya', 25, 'Mumbai'),
('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Mumbai'),
('Aaditya', 40, 'Dehradun'),
('Seema', 32, 'Delhi')
]
# Creating a DataFrame object
df = pd.DataFrame(employees,
columns=['Name', 'Age', 'City'])
# Print the Dataframe
df
Output
Name Age City
0 Stuti 28 Varanasi
1 Saumya 32 Delhi
2 Aaditya 25 Mumbai
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
6 Aaditya 40 Dehradun
7 Seema 32 Delhi
Find All Duplicate Rows in a Pandas Dataframe
Below are the examples by which we can select duplicate rows in a DataFrame:
- Select Duplicate Rows Based on All Columns
- Get List of Duplicate Last Rows Based on All Columns
- Select List Of Duplicate Rows Using Single Columns
- Select List Of Duplicate Rows Using Multiple Columns
- Select Duplicate Rows Using Sort Values
Select Duplicate Rows Based on All Columns
Here, We do not pass any argument, therefore, it takes default values for both the arguments i.e. subset = None and keep = 'first'.
Python3
# Import pandas library
import pandas as pd
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
('Saumya', 32, 'Delhi'),
('Aaditya', 25, 'Mumbai'),
('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Mumbai'),
('Aaditya', 40, 'Dehradun'),
('Seema', 32, 'Delhi')
]
# Creating a DataFrame object
df = pd.DataFrame(employees,
columns=['Name', 'Age', 'City'])
# Selecting duplicate rows except first
# occurrence based on all columns
duplicate = df[df.duplicated()]
print("Duplicate Rows :")
# Print the resultant Dataframe
duplicate
Output
Duplicate Rows :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
Get List of Duplicate Last Rows Based on All Columns
If you want to consider all duplicates except the last one then pass keep = 'last' as an argument.
Python3
# Import pandas library
import pandas as pd
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
('Saumya', 32, 'Delhi'),
('Aaditya', 25, 'Mumbai'),
('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Mumbai'),
('Aaditya', 40, 'Dehradun'),
('Seema', 32, 'Delhi')
]
# Creating a DataFrame object
df = pd.DataFrame(employees,
columns=['Name', 'Age', 'City'])
# Selecting duplicate rows except last
# occurrence based on all columns.
duplicate = df[df.duplicated(keep='last')]
print("Duplicate Rows :")
# Print the resultant Dataframe
duplicate
Output
Duplicate Rows :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
Select List Of Duplicate Rows Using Single Columns
If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument.
Python3
# import pandas library
import pandas as pd
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
('Saumya', 32, 'Delhi'),
('Aaditya', 25, 'Mumbai'),
('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Mumbai'),
('Aaditya', 40, 'Dehradun'),
('Seema', 32, 'Delhi')
]
# Creating a DataFrame object
df = pd.DataFrame(employees,
columns = ['Name', 'Age', 'City'])
# Selecting duplicate rows based
# on 'City' column
duplicate = df[df.duplicated('City')]
print("Duplicate Rows based on City :")
# Print the resultant Dataframe
duplicate
Output
Duplicate Rows based on City :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
7 Saumya 32 Delhi
Select List Of Duplicate Rows Using Multiple Columns
In this example, a pandas DataFrame is created from a list of employee tuples with columns 'Name,' 'Age,' and 'City.' The code identifies and displays duplicate rows based on the 'Name' and 'Age' columns, highlighting instances where individuals share the same name and age.
Python3
# import pandas library
import pandas as pd
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
('Saumya', 32, 'Delhi'),
('Aaditya', 25, 'Mumbai'),
('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Mumbai'),
('Aaditya', 40, 'Dehradun'),
('Seema', 32, 'Delhi')
]
# Creating a DataFrame object
df = pd.DataFrame(employees,
columns=['Name', 'Age', 'City'])
# Selecting duplicate rows based
# on list of column names
duplicate = df[df.duplicated(['Name', 'Age'])]
print("Duplicate Rows based on Name and Age :")
# Print the resultant Dataframe
duplicate
Output
Duplicate Rows based on City :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
Select Duplicate Rows Using Sort Values
In this example, a pandas DataFrame is created from a list of employee tuples, and duplicate rows based on the 'Name' and 'Age' columns are identified and displayed, with the resulting DataFrame sorted by the 'Age' column. The code showcases how to find and organize duplicate entries in a tabular data structure
Python3
import pandas as pd
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
('Saumya', 32, 'Delhi'),
('Aaditya', 25, 'Mumbai'),
('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Mumbai'),
('Aaditya', 40, 'Dehradun'),
('Seema', 32, 'Delhi')
]
# Creating a DataFrame object
df = pd.DataFrame(employees,
columns=['Name', 'Age', 'City'])
# Finding and sorting duplicate rows based on 'Name' and 'Age'
duplicate_sorted = df[df.duplicated(['Name', 'Age'], keep=False)].sort_values('Age')
print("Duplicate Rows based on Name and Age (sorted):")
# Print the resultant DataFrame
print(duplicate_sorted)
Output
Duplicate Rows based on Name and Age (sorted):
Name Age City
1 Saumya 32 Delhi
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
Similar Reads
Apply a function to single or selected columns or rows in Pandas Dataframe
In this article, we will learn different ways to apply a function to single or selected columns or rows in Dataframe. We will use Dataframe/series.apply() method to apply a function. Apply a function to single row in Pandas DataframeHere, we will use different methods to apply a function to single r
5 min read
PySpark DataFrame - Select all except one or a set of columns
In this article, we are going to extract all columns except a set of columns or one column from Pyspark dataframe. For this, we will use the select(), drop() functions. But first, let's create Dataframe for demonestration. Python3 # importing module import pyspark # importing sparksession from pyspa
2 min read
Filtering rows based on column values in PySpark dataframe
In this article, we are going to filter the rows based on column values in PySpark dataframe. Creating Dataframe for demonstration:Python3 # importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n
2 min read
How to Select Rows from a Dataframe based on Column Values ?
Selecting rows from a Pandas DataFrame based on column values is a fundamental operation in data analysis using pandas. The process allows to filter data, making it easier to perform analyses or visualizations on specific subsets. Key takeaway is that pandas provides several methods to achieve this,
4 min read
Select all columns, except one given column in a Pandas DataFrame
DataFrame Data structure are the heart of Pandas library. DataFrames are basically two dimension Series object. They have rows and columns with rows representing the index and columns representing the content. Now, let's see how to Select all columns, except one given column in Pandas DataFrame in P
2 min read
How to Find & Drop duplicate columns in a Pandas DataFrame?
Letâs discuss How to Find and drop duplicate columns in a Pandas DataFrame. First, Letâs create a simple Dataframe with column names 'Name', 'Age', 'Domicile', and 'Age'/'Marks'. Find Duplicate Columns from a DataFrameTo find duplicate columns we need to iterate through all columns of a DataFrame a
4 min read
Check whether a given column is present in a Pandas DataFrame or not
Consider a Dataframe with 4 columns : 'ConsumerId', 'CarName', CompanyName, and 'Price'. We have to determine whether a particular column is present in the DataFrame or not in Pandas Dataframe using Python. Creating a Dataframe to check if a column exists in DataframePython3 # import pandas library
2 min read
How to Select Rows & Columns by Name or Index in Pandas Dataframe - Using loc and iloc
When working with labeled data or referencing specific positions in a DataFrame, selecting specific rows and columns from Pandas DataFrame is important. In this article, weâll focus on pandas functionsâloc and ilocâthat allow you to select rows and columns either by their labels (names) or their int
4 min read
Pandas filter a dataframe by the sum of rows or columns
In this article, we will see how to filter a Pandas DataFrame by the sum of rows or columns. This can be useful in some conditions. Let's suppose you have a data frame consisting of customers and their purchased fruits. Â The rows consist of different customers and columns contain different types of
4 min read
Selecting rows in pandas DataFrame based on conditions
Letâs see how to Select rows based on some conditions in Pandas DataFrame. Selecting rows based on particular column value using '>', '=', '=', '<=', '!=' operator. Code #1 : Selecting all the rows from the given dataframe in which 'Percentage' is greater than 80 using basic method. Python# im
6 min read