Join algorithms in Database
Last Updated :
23 Apr, 2025
A well-designed database aims to reduce unnecessary repetition of data. This is achieved through normalization which breaks data into smaller, more manageable tables. However, when we need to combine information from different tables, we use joins.
One common type of join is the inner equijoin. This join combines rows from two tables where the values of specific columns (usually keys) match.
How Joins Work
When you perform a join, it takes two pieces of data (called tuples) from two different tables and combines them into a new piece of data. The key part of the join is matching values in both tables based on a common attribute, like a unique ID.
Basic Concept: A join combines rows (tuples) from two or more tables based on a related column. The related column is typically a foreign key in one table that matches the primary key in another table.
Matching Values: The join operation looks for matching values in the columns specified for the join condition. For example, if you’re joining an Employees table and a Departments table, the join matches department_id from Employees with id from Departments.
Result of Join: When a match is found, SQL combines the matching rows into a single new row in the result set. The new row contains data from both tables for the matched rows.
Types of Output from Joins
The way the database combines the data can differ depending on how it’s set up. There are two main ways this happens:
1. Data Copying: When performing a join operation, the database copies the values from the matching attributes of both the outer and inner tuples into a new output tuple. This means that the data from both tables involved in the join are combined and placed into a result.
Once this data is combined into the output tuple, future operations in the query plan can continue processing the result without having to go back to the original base tables for more data. This approach is useful because it minimizes the need to re-access the original tables, which can be slow and costly in terms of performance. Instead, the combined data is readily available in the intermediate result, speeding up the query execution.

Data Copying
2. Record IDs: In some cases, instead of copying all the data from the matching rows, the database only copies the join keys (the attributes used to match the rows) and the record IDs of the matching tuples. This approach is more efficient because it avoids copying unnecessary data.
This method is especially useful for column stores because the database only stores the data that is needed for the query, reducing memory usage.
This approach is called late materialization. It means that the database delays the actual copying of data until it’s absolutely necessary, helping to improve performance and reduce the amount of data stored in memory.

Record IDs
In real-world database systems, the approach for performing joins can vary. DBMS often combine two methods (copying data and copying record IDs) based on factors such as the number of tuples, columns, and the types of operations in the query. The choice of strategy is dynamic, aiming to optimize performance based on the query’s specific needs.
I/O Cost Analysis
When analyzing different join algorithms, we focus on I/O costs that is the number of disk read and write operations needed to compute the join. This is because disk I/O operations tend to dominate the cost especially in disk-based systems so we typically ignore the computation costs for join algorithms.
The I/O cost only considers the actions taken during the join computation. The cost of outputting the result is not considered because that depends on the size of the output data, which is the same regardless of the join algorithm used.
Variables:
- M: The number of pages in table R (the outer table).
- N: The number of pages in table S (the inner table).
- m: The total number of tuples in table R.
- n: The total number of tuples in table S
Consider a query:
SELECT R.id, S.cdate
FROM R JOIN S
ON R.id = S.id
WHERE S.value > 100
This query performs a join between two tables, R and S, using the id attribute. The query also filters rows from table S where the value is greater than 100.
In the case of joins, the natural join (R ⨝ S
) is one of the most common operations. However, some algorithms can be inefficient if not optimized well. For example, a cross product (R × S
) might first be computed, and then relevant rows are selected. This approach is inefficient because the cross product can be extremely large, leading to high computation costs.
Join Algorithms
At a high level, a join algorithm typically uses two loops to compare each row (or tuple) from two tables. One table is called the outer table, and the other is the inner table. The outer table’s tuples are compared with the inner table’s tuples and if they match based on the join condition, the matching rows are combined and output.
For each tuple tR in table TR do
For each tuple ts in table TS do
Compare (tR, ts) if they satisfy the join condition
If they match, add the result to the join output
End
End
The outer table is usually the smaller table, in terms of the number of rows or pages, because the DBMS can load it into memory to speed up the comparison process. Using indexes (special look-up structures) when available can also make the process faster.
1. Naive Nested Loop Join
- In this simplest form of nested loop join, The DBMS compares each tuple from the outer table with every tuple in the inner table. This approach is the worst-case scenario because the DBMS has to scan the entire inner table for each tuple in the outer table. No caching or memory optimization is used.
- Total Cost: The cost in terms of I/O operations depends on how many pages need to be read. For each tuple in the outer table (
m
tuples), the DBMS has to scan all the tuples in the inner table (n
tuples).
- Outer Table (M pages): The outer table has M pages, which means it contains m tuples (rows).
- Inner Table (N pages): The inner table has N pages, which means it contains n tuples (rows).
- Cost Measurement: Since each page can hold multiple tuples, the cost of the join operation is typically measured in terms of page accesses.
- Cost Formula: The formula for the estimated cost of a Naive Nested Loop Join is:
M + (m × N)
- Where:
- M: The number of pages in the outer table.
- m × N: Refers to the tuple comparisons made between the outer and inner tables. For each tuple from the outer table (m tuples), the DBMS must compare it with all the tuples in the inner table (n tuples). Hence, the total number of comparisons is m × N.
2. Block Nested Loop Join
- The block nested loop join improves on the naive approach by scanning tables in blocks (groups of tuples). The DBMS reads a block from the outer table and compares it with a block from the inner table. This reduces the number of disk accesses because it processes multiple tuples at once, instead of checking each tuple individually.
- Cost: If there are M pages in the outer table, and the number of blocks in R (outer table) is limited by the available memory (B buffers), the cost of this operation is:
M + ((number of blocks in R) × N)
- The block size is determined by the memory available for buffers. If there are B buffers, the DBMS uses B – 2 buffers to scan the outer table. One buffer is used for the inner table, and one is used for storing the join result.
3. Index Nested Loop Join
- The index nested loop join uses an index on the inner table to quickly find matching rows instead of scanning the entire table. While the outer table is still scanned row by row, the DBMS uses the index on the inner table to speed up the process of finding matches. If the inner table has an index on the join key, it greatly improves the join performance.
- Cost: The cost for this join is:
M + (m × C)
- Here, M is the number of pages in the outer table, m is the number of tuples in the outer table, and C is the constant cost of each index lookup.
Read more about Nested Loop Join in DBMS
Related GATE questions:
Similar Reads
Join Dependency in Database
Join Dependency means re-creating the original Table by joining multiple sub-tables of the given Table. It is a further generalization of MVD(multi-valued Dependencies). When a relation R can be obtained by joining the R1, R2, R3..., Rn where R1, R2, R3..., Rn are sub-relations of R, it is called a
5 min read
Joins in DBMS
A join is an operation that combines the rows of two or more tables based on related columns. This operation is used for retrieving the data from multiple tables simultaneously using common columns of tables. In this article, we are going to discuss every point about joins. What is Join?Join is an o
7 min read
Mobile Database
In this article, we will discuss the overview of the mobile databases and will emphasize its features of the mobile database, and then will cover involves parties, and it's limitations. Let's discuss it one by one. Overview :A Mobile database is a database that can be connected to a mobile computing
2 min read
Interesting Facts about DBMS
The amount of information we are surrounded with is literally exploding every single day and there is an immediate need to organise all these data. Database Management System (DBMS) extract information from millions of facts or data stored in a database. As the need for maintenance increased the dem
3 min read
Denormalization in Databases
Denormalization focuses on combining multiple tables to make queries execute quickly. It adds redundancies in the database though. In this article, weâll explore Denormalization and how it impacts database design. This method can help us to avoid costly joins in a relational database made during nor
6 min read
What is Database?
In todayâs data-driven world, databases are indispensable for managing, storing, and retrieving information efficiently. From small-scale businesses to global enterprises, databases serve as the backbone of operations, powering applications, websites, and analytics systems. In this comprehensive art
15 min read
Logical Database
A Logical Database is a special type of ABAP (Advance Business Application and Programming) that is used to retrieve data from various tables and the data is interrelated to each other. Also, a logical database provides a read-only view of Data. Structure Of Logical Database:A Logical database uses
4 min read
Join Dependencies in DBMS
Join Dependency (JD) can be illustrated as when the relation R is equal to the join of the sub-relations R1, R2,..., and Rn are present in the database. Join Dependency arises when the attributes in one relation are dependent on attributes in another relation, which means certain rows will exist in
5 min read
Types of Relationship in Database
A relationship in a DBMS exists when a variable has a connection with the properties stored in different tables. Such relationships help the organization of entities intertwined with each other, ultimately enabling efficient data processing. They're exhibited usually via keys in a table, which is ei
4 min read
Merge Join in DBMS
Merge be part of is a hard and fast-based be part of operation used in database control systems (DBMS) to mix rows from or extra tables based on an associated column among them. It is mainly efficient whilst the tables involved are large and while they are each sorted on the be a part of the key, wh
7 min read