Open In App

Dealing with Duplicate Data Points in an SQL Query

Last Updated : 24 Oct, 2025
Comments
Improve
Suggest changes
2 Likes
Like
Report

Dealing with duplicate data points in an SQL query involves identifying and managing these duplicates to ensure data integrity. The process typically includes:

  • Identifying Duplicates: Use the GROUP BY clause on the columns you want to check for duplicates and the HAVING clause with the COUNT function to filter groups with more than one occurrence.
  • Listing Duplicate Records: Join the original table with the result of the duplicate identification query to list all duplicate records.
  • Removing Duplicates: Use DELETE with a ROW_NUMBER() window function to retain only one instance of each duplicate.

Understanding Duplicate Data

Duplicate data indicates the presence of records in a database that are similar in one or more databases. Such duplicate records can be created by mistakes in data entry, using multiple sources of data or unsuitable data processing practices. If not well managed, duplicate data is capable of causing different problems, for instance, repetition of information, incorrect results and slow management of data.

Examples of Duplicate Data:

  • In the context of a customer database, it means that the customers listed more than once in the database.
  • Moves or interactions between buyers and sellers that occur in a sales database.
  • It refers to the earlier situation where there were copied or repeated product listings in an inventory database.

Methods for Handling Duplicate Data in SQL Queries

Many different ways exist to prevent and get rid of duplicity in SQL. The most common techniques include:

  1. Using the DISTINCT Keyword
  2. Using GROUP BY and HAVING Clauses
  3. Using ROW_NUMBER() with Common Table Expressions (CTEs)
  4. Using the DELETE Statement

1. Using the DISTINCT Keyword

The keyword DISTINCT in SQL is applied in the validation of any statement by deleting all the duplicate records from the result.

Syntax:

SELECT DISTINCT column1, column2, ...
FROM table_name;

2. Using GROUP BY and HAVING Clauses

The GROUP BY clause is used to consolidate aggregated and calculated rows from specified columns. The last operation, used in conjunction with the group functions, is the HAVING clause that provides for filtering groups according to a specific criterion.

Syntax:

SELECT column1, COUNT(*)
FROM table_name
GROUP BY column1
HAVING COUNT(*) > 1;

3. Using ROW_NUMBER() with Common Table Expressions (CTEs)

When used, a ROW_NUMBER() function assigns a sequential integer to the rows found in the partition of an ordered result set. When employed alongside CTEs, this function assists in processes like elimination of duplicate data.

Syntax:

WITH CTE AS (
SELECT column1, column2,
ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY column3) AS row_num
FROM table_name
)
SELECT * FROM CTE
WHERE row_num = 1;

4. Using the DELETE Statement

The INSERT statement is used to add rows to a table while DELETE statement is used to delete some rows in a table given specific conditions. When used with CTEs or ROW_NUMBER(), it can actually filter out duplicates to your data.

WITH CTE AS (
SELECT column1, column2,
ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY column3) AS row_num
FROM table_name
)
DELETE FROM CTE
WHERE row_num > 1;

Example: Removing Duplicate Customer Records

Suppose that there is a Customers table in the DB and some nodes of the table contain same information about the customers. To remove these duplicates:

  • Customers Table:

CustomerID

CustomerName

City

1

John Doe

New York

2

Jane Smith

Chicago

3

John Doe

New York

4

Jane Smith

Chicago

  • SQL Query:
WITH CTE AS (
SELECT CustomerID, CustomerName, City,
ROW_NUMBER() OVER (PARTITION BY CustomerName, City ORDER BY CustomerID) AS row_num
FROM Customers
)
DELETE FROM CTE
WHERE row_num > 1;

Preventing Duplicate Records

Preventing duplicates is better than dealing with them after they occur. Here are some strategies to prevent duplicates:

1. Using Constraints

SQL constraints such as PRIMARY KEY and UNIQUE can enforce uniqueness in a table.

ALTER TABLE table_name
ADD CONSTRAINT unique_constraint UNIQUE (column1, column2);

This constraint ensures that the combination of column1 and column2 is unique across the table.

2. Using Triggers

Triggers can be used to check for duplicates before inserting or updating data.

CREATE TRIGGER prevent_duplicates
BEFORE INSERT ON table_name
FOR EACH ROW
BEGIN
IF EXISTS (SELECT 1 FROM table_name WHERE column1 = NEW.column1 AND column2 = NEW.column2) THEN
SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = 'Duplicate entry';
END IF;
END;

This trigger prevents the insertion of duplicate rows by raising an error if a duplicate is detected.

Handling Duplicates in Data Merging

When merging data from multiple sources, duplicates are often inevitable. Here are some strategies to handle them:

1. Standardizing Data

Standardizing data involves cleaning and transforming data to ensure consistency. This can include removing leading/trailing spaces, converting to a common case (e.g., all lowercase) and normalizing formats (e.g., date formats).

UPDATE table_name
SET column1 = TRIM(LOWER(column1));

2. Using UNION and UNION ALL

The UNION operator combines the results of two queries and removes duplicates, while UNION ALL includes duplicates.

SELECT column1, column2
FROM table1
UNION
SELECT column1, column2
FROM table2;

This query combines the results from table1 and table2 and removes duplicates.


Explore