Merging DataFrames is a common operation when working with multiple datasets in Pandas. The `merge()` function allows you to combine two DataFrames based on a common column or index. In this article, we will explore how to merge DataFrames using various options and techniques.
We will load the datasets into two Pandas DataFrames and merge them based on the ID column.
Python
import pandas as pd
# Create DataFrame for Dataset 1
data1 = [[1, 'Raj', 25],
[2, 'Sharma', 42],
[3, 'Saroj', 22],
[4, 'Raja', 51],
[5, 'Kajal', 26]]
df1 = pd.DataFrame(data1, columns=['ID', 'Name', 'Age'])
# Create DataFrame for Dataset 2
data2 = [[1, 'Male', 25000],
[2, 'Male', 42500],
[3, 'Female', 22300],
[4, 'Male', 51400],
[5, 'Female', 26800]]
df2 = pd.DataFrame(data2, columns=['ID', 'Gender', 'Salary'])
Method 1: Inner Merge
An inner merge returns only the rows that have matching values in both DataFrames. If a row doesnt have a corresponding match in either DataFrame, it is excluded.
- Call the merge() function with how=inner.
Python
# Perform an inner merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)
This merge includes only the rows where the ID exists in both DataFrames. In this case, the rows for ID = 1, 2, 4, and 5 will be included in the result.
Method 2: Left Merge
A left merge returns all the rows from the left DataFrame (first DataFrame) and the matching rows from the right DataFrame. If there is no match, the result will contain NaN values for columns from the right DataFrame.
- Call the merge() function with how=left.
Python
# Perform a left merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='left')
print(merged_df)
This method ensures all rows from the left DataFrame (df1) are kept, and columns from df2 are added where a matching ID is found. If no match is found, NaN is used for the columns from df2.
Method 3: Right Merge
A right merge is the opposite of a left merge. It returns all the rows from the right DataFrame (second DataFrame) and the matching rows from the left DataFrame.
- Call the merge() function with how=right.
Python
# Perform a right merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='right')
print(merged_df)
This merge ensures that all rows from the right DataFrame (df2) are included. If no match is found for a row in df1, the left DataFrame’s columns will be NaN.
Method 4: Outer Merge
An outer merge returns all rows from both DataFrames, with `NaN` for missing values where there is no match. This is useful when you want to keep all the data from both DataFrames.
- Call the merge() function with how=outer.
Python
# Perform an outer merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='outer')
print(merged_df)
This method includes all rows from both DataFrames. If a row in df1 doesnt have a match in df2, or vice versa, the corresponding columns will have NaN values.
Method 5: Merging on Multiple Columns
You can merge DataFrames based on multiple columns by passing a list of column names to the on parameter. This allows you to perform more complex merges when the matching criteria are based on multiple columns.
- Call the merge() function with a list of column names.
Python
# Example with additional columns
df1['Country'] = ['USA', 'Canada', 'USA', 'Canada', 'USA']
df2['Country'] = ['USA', 'Canada', 'USA', 'Canada', 'Mexico']
# Merge on 'ID' and 'Country'
merged_df = pd.merge(df1, df2, on=['ID', 'Country'], how='inner')
print(merged_df)
Here, the merge is based on both the ID and Country columns. Only rows that have matching values in both columns from both DataFrames will be included.
You can refer this article for more detailed explanation: Merge Join two dataframes on multiple columns in Pandas
Method 6: Merging with Different Column Names
If the columns you want to merge on have different names in the two DataFrames, you can use the left_on and right_on parameters to specify the columns to merge on.
- Use left_on and right_on to specify the merge columns.
Python
# Rename 'ID' in df2 to 'EmployeeID' for demonstration
df2.rename(columns={'ID': 'EmployeeID'}, inplace=True)
# Merge using different column names
merged_df = pd.merge(df1, df2, left_on='ID', right_on='EmployeeID', how='inner')
print(merged_df)
Here, we merged the DataFrames using different column names: ID in df1 and EmployeeID in df2. The left_on and right_on parameters allow you to specify which columns to use for the merge.
You can refer this article for more detailed explanation: Merge two dataframes with different columns
Summary:
- Inner merge: Includes only the matching rows from both DataFrames.
- Left merge: Keeps all rows from the left DataFrame, adding matching rows from the right.
- Right merge: Keeps all rows from the right DataFrame, adding matching rows from the left.
- Outer merge: Includes all rows from both DataFrames, with NaN where there are no matches.
- Merge on multiple columns: Combine DataFrames based on multiple criteria.
- Merge with different column names: Specify different columns to merge using left_on and right_on.
Recommendation: For most scenarios, the inner merge is the most commonly used, but depending on your data and the relationship between the two DataFrames, other merge types may be more appropriate. Experiment with these options to find the best fit for your data merging needs.
Similar Reads
Pandas DataFrame A Pandas DataFrame is a two-dimensional table-like structure in Python where data is arranged in rows and columns. Itâs one of the most commonly used tools for handling data and makes it easy to organize, analyze and manipulate data. It can store different types of data such as numbers, text and dat
10 min read
Pandas Dataframe Difference When working with multiple DataFrames, you might want to compute the differences between them, such as identifying rows that are in one DataFrame but not in another. Pandas provides various ways to compute the difference between DataFrames, whether it's comparing rows, columns, or entire DataFrames.
4 min read
Merge Multiple Dataframes - Pandas Merging allow us to combine data from two or more DataFrames into one based on index values. This is used when we want to bring together related information from different sources. In Pandas there are different ways to combine DataFrames:1. Merging DataFrames Using merge()We use merge() when we want
3 min read
Pandas Join Dataframes Joining DataFrames is a common operation in data analysis, where you combine two or more DataFrames based on common columns or indices. Pandas provides various methods to perform joins, allowing you to merge data in flexible ways. In this article, we will explore how to join DataFrames using methods
4 min read
Joining two Pandas DataFrames using merge() The merge() function is designed to merge two DataFrames based on one or more columns with matching values. The basic idea is to identify columns that contain common data between the DataFrames and use them to align rows. Let's understand the process of joining two pandas DataFrames using merge(), e
4 min read