How Do You Get Rid of Duplicates in An SQL JOIN
How Do You Get Rid of Duplicates in An SQL JOIN
learnsql.com/blog/get-rid-of-duplicates-sql-join/
Kateryna Koidan
Do you have unwanted duplicates from your SQL JOIN query? In this article, I’ll discuss the possible reasons for getting duplicates
after joining tables in SQL and show how to fix a query depending on the reason behind the duplicates.
Data analysts with little experience in SQL JOINs often encounter unwanted duplicates in the result set. It’s challenging for
beginners to identify the reason behind these duplicates in JOINs.
The best way to learn SQL JOINs is through practice. I recommend the interactive SQL JOINs course. It contains over 90
exercises that make you practice the different JOIN types in SQL.
In this article, I’ll discuss the most common issues leading to duplicates in SQL JOIN outputs. I’ll also show possible solutions to
these common issues.
For example, let’s say you have a list of the top 100 movies from the 20th century, and you want to subset it to the movies made by
currently living directors. In your movies table, you don’t have detailed information on the movie directors, just their IDs. But you
do have a separate directors table, with the ID, the full name, the birth year, and the death year (if applicable) of each director.
In your query, you can join two tables by the director’s ID to get a list of movies made by currently living directors:
As you can see, we specify the tables we want to join in the FROM and JOIN clauses. Then in the ON clause, we specify the
columns from each table to be used for joining these tables. If you are new to SQL JOINs, check out this introductory guide. Here’s
also an SQL JOIN cheat sheet with syntax and examples of different JOINs.
1/7
The SQL JOIN is a great tool that provides a variety of options beyond the simple join of two tables. If you are not familiar with
SQL JOIN types, read this article that explains them with illustrations and examples. Depending on your use case, you can choose
INNER JOIN , LEFT JOIN , RIGHT JOIN , and FULL JOIN . You may even need to join tables without a common column or join
more than two tables.
Now, let’s see how these different JOINs may result in unwanted duplicates.
Let’s start by briefly reviewing the data to be used for our examples. Imagine we run a real estate agency that sells houses
somewhere in the United States. We have tables with agents , customers , and sales . See below for what data is stored in
each table.
agents
1 Kate White 5
2 Melissa Brown 2
3 Alexandr McGregor 3
4 Sophia Scott 3
5 Steven Black 1
6 Maria Scott 1
customers
sales
1. Missing ON Condition
Beginners unfamiliar with SQL JOINs often simply list the tables in FROM without specifying the JOIN condition at all when trying
to combine information from two or more tables. This is valid syntax, so you do not get any error messages. But the result is a
cross join with all rows from one table combined with all rows from another table.
2/7
For example, suppose we want to get information on the customer who bought a particular house (ID #2134). If we use the
following query:
Instead of one record with the customer we want, we have all our customers listed in the result set.
To fix the query, you need an explicit JOIN syntax. The tables to be combined are specified in FROM and JOIN , and the join
condition is specified in the ON clause:
Here, we specify the customer ID from the sales table to match the customer ID from the customers table. This gives us the
desired result:
You could specify the join condition in the WHERE clause to get the same result. But that is against the intended use of the
WHERE clause. Also, there are additional benefits from using the JOIN syntax rather than listing the tables in FROM . Check out
this article to understand why the JOIN syntax is preferred.
Let’s say we want to see the experience level of the real estate agent for every house sold. If we start by joining the sales and
agents tables by the agent’s last name:
3/7
house_id first_name last_name experience_years
That didn’t work well. We have two different agents with the last name Scott: Maria and Sophia. As a result, houses #1015 and
#2134 are each included twice with different agents.
To fix this query, we need to join the sales and agents tables using two pairs of columns, corresponding to the last name and
the first name of the agent:
While JOIN is one of the basic tools in SQL, you need to be aware of the many different nuances to join tables effectively. I
recommend practicing SQL JOINs with this interactive course that covers a variety of joining scenarios with 93 coding challenges.
In some cases, the records in the result set are not duplicates but appear as if they are because the selected subset of columns
doesn’t show all differences between records.
For example, imagine we want to see the dates each real estate agent sold a house. If we use the following query:
The result set includes two records with Alexandr McGregor that appear identical. However, if you add house ID to the SELECT
statement, you see these two records correspond to the sale of two different houses on the same day.
If you are not interested in this additional information and want to have only one row displayed here, use DISTINCT :
4/7
Now, the result is:
A similar problem may occur if you want to list only the rows from one table but there are several matching records in the other
table. You end up with unwanted duplicates in your result set.
For instance, say we want to list all customers who bought houses via our agency. If we use the following query:
As you see, the resulting table includes Ivan Lee twice. This is because he bought two houses and there are two corresponding
records in the sales table. One possible solution is to use DISTINCT as in the previous example. An even better solution is to
avoid using SQL JOIN at all by filtering the result set using the EXISTS keyword:
This gives you the desired output and also makes the intention of your query clearer.
Let’s say we want our agents to form pairs for our next training. Obviously, we don’t want any agent to be paired with
himself/herself. So, we might specify the ON condition a1.id <> a2.id :
5/7
SELECT
a1.first_name as agent1_first_name,
a1.last_name as agent1_last_name,
a1.experience_years as agent1_experience,
a2.first_name as agent2_first_name,
a2.last_name as agent2_last_name,
a2.experience_years as agent2_experience
FROM agents a1
JOIN agents a2
ON a1.id <> a2.id
ORDER BY a1.id;
However, this query outputs each pair twice. For example, in the first row of the table below, Kate White is considered Agent 1, and
Maria Scott is considered Agent 2. But closer to the end of the table, you get the same pair of agents but with Maria Scott as Agent
1 and Kate White as Agent 2.
6/7
To solve this issue, you need to add an explicit condition to include each pair only once. One common solution is to specify the
joining condition a1.id < a2.id . With this, you get the pair Kate White and Maria Scott but not vice versa. This is because
Kate’s ID (1) is a lower number than Maria’s ID (6).
In practice, you may have some other conditions for pairing the agents. For instance, you may want to pair more experienced
agents (3+ years) with less experienced ones (< 3 years). The corresponding filtering condition in WHERE solves the problem:
SELECT
a1.first_name as agent1_first_name,
a1.last_name as agent1_last_name,
a1.experience_years as agent1_experience,
a2.first_name as agent2_first_name,
a2.last_name as agent2_last_name,
a2.experience_years as agent2_experience
FROM agents a1
JOIN agents a2
ON a1.id <> a2.id
WHERE a1.experience_years>=3 AND a2.experience_years < 3
ORDER BY a1.id;
This result set looks much better and makes it easier to select three pairs, each consisting of an agent with more experience and
another with less experience.
If you have only basic experience with SQL and want to combine data more confidently from multiple tables, I recommend this
SQL JOINs interactive course. It covers all major types of JOINs as well as joining a table with itself, joining multiple tables in one
query, and joining tables on non-key columns. Get more details about this course in this overview article.
Bonus. Here are the top 10 SQL JOIN interview questions with answers.
7/7