Pandas Merge Dataframe

Merging DataFrames is a common operation when working with multiple datasets in Pandas. The `merge()` function allows you to combine two DataFrames based on a common column or index. In this article, we will explore how to merge DataFrames using various options and techniques.

We will load the datasets into two Pandas DataFrames and merge them based on the ID column.

Python

import pandas as pd

# Create DataFrame for Dataset 1
data1 = [[1, 'Raj', 25], 
         [2, 'Sharma', 42], 
         [3, 'Saroj', 22], 
         [4, 'Raja', 51], 
         [5, 'Kajal', 26]]
df1 = pd.DataFrame(data1, columns=['ID', 'Name', 'Age'])

# Create DataFrame for Dataset 2
data2 = [[1, 'Male', 25000], 
         [2, 'Male', 42500], 
         [3, 'Female', 22300], 
         [4, 'Male', 51400], 
         [5, 'Female', 26800]]
df2 = pd.DataFrame(data2, columns=['ID', 'Gender', 'Salary'])

Method 1: Inner Merge

An inner merge returns only the rows that have matching values in both DataFrames. If a row doesnt have a corresponding match in either DataFrame, it is excluded.

Call the merge() function with how=inner.

Python

# Perform an inner merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)

This merge includes only the rows where the ID exists in both DataFrames. In this case, the rows for ID = 1, 2, 4, and 5 will be included in the result.

Method 2: Left Merge

A left merge returns all the rows from the left DataFrame (first DataFrame) and the matching rows from the right DataFrame. If there is no match, the result will contain NaN values for columns from the right DataFrame.

Call the merge() function with how=left.

Python

# Perform a left merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='left')
print(merged_df)

This method ensures all rows from the left DataFrame (df1) are kept, and columns from df2 are added where a matching ID is found. If no match is found, NaN is used for the columns from df2.

Method 3: Right Merge

A right merge is the opposite of a left merge. It returns all the rows from the right DataFrame (second DataFrame) and the matching rows from the left DataFrame.

Call the merge() function with how=right.

Python

# Perform a right merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='right')
print(merged_df)

This merge ensures that all rows from the right DataFrame (df2) are included. If no match is found for a row in df1, the left DataFrame’s columns will be NaN.

Method 4: Outer Merge

An outer merge returns all rows from both DataFrames, with `NaN` for missing values where there is no match. This is useful when you want to keep all the data from both DataFrames.

Call the merge() function with how=outer.

Python

# Perform an outer merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='outer')
print(merged_df)

This method includes all rows from both DataFrames. If a row in df1 doesnt have a match in df2, or vice versa, the corresponding columns will have NaN values.

Method 5: Merging on Multiple Columns

You can merge DataFrames based on multiple columns by passing a list of column names to the on parameter. This allows you to perform more complex merges when the matching criteria are based on multiple columns.

Call the merge() function with a list of column names.

Python

# Example with additional columns
df1['Country'] = ['USA', 'Canada', 'USA', 'Canada', 'USA']
df2['Country'] = ['USA', 'Canada', 'USA', 'Canada', 'Mexico']

# Merge on 'ID' and 'Country'
merged_df = pd.merge(df1, df2, on=['ID', 'Country'], how='inner')
print(merged_df)

Here, the merge is based on both the ID and Country columns. Only rows that have matching values in both columns from both DataFrames will be included.

You can refer this article for more detailed explanation: Merge Join two dataframes on multiple columns in Pandas

Method 6: Merging with Different Column Names

If the columns you want to merge on have different names in the two DataFrames, you can use the left_on and right_on parameters to specify the columns to merge on.

Use left_on and right_on to specify the merge columns.

Python

# Rename 'ID' in df2 to 'EmployeeID' for demonstration
df2.rename(columns={'ID': 'EmployeeID'}, inplace=True)

# Merge using different column names
merged_df = pd.merge(df1, df2, left_on='ID', right_on='EmployeeID', how='inner')
print(merged_df)

Here, we merged the DataFrames using different column names: ID in df1 and EmployeeID in df2. The left_on and right_on parameters allow you to specify which columns to use for the merge.

You can refer this article for more detailed explanation: Merge two dataframes with different columns

Summary:

Inner merge: Includes only the matching rows from both DataFrames.
Left merge: Keeps all rows from the left DataFrame, adding matching rows from the right.
Right merge: Keeps all rows from the right DataFrame, adding matching rows from the left.
Outer merge: Includes all rows from both DataFrames, with NaN where there are no matches.
Merge on multiple columns: Combine DataFrames based on multiple criteria.
Merge with different column names: Specify different columns to merge using left_on and right_on.

Recommendation: For most scenarios, the inner merge is the most commonly used, but depending on your data and the relationship between the two DataFrames, other merge types may be more appropriate. Experiment with these options to find the best fit for your data merging needs.

Method 1: Inner Merge

Method 2: Left Merge

Method 3: Right Merge

Method 4: Outer Merge

Method 5: Merging on Multiple Columns

Method 6: Merging with Different Column Names

Summary:

Explore