Merging DataFrames is a common operation when working with multiple datasets in Pandas. The `merge()` function allows you to combine two DataFrames based on a common column or index. In this article, we will explore how to merge DataFrames using various options and techniques.
We will load the datasets into two Pandas DataFrames and merge them based on the ID column.
Python
import pandas as pd
# Create DataFrame for Dataset 1
data1 = [[1, 'Raj', 25],
[2, 'Sharma', 42],
[3, 'Saroj', 22],
[4, 'Raja', 51],
[5, 'Kajal', 26]]
df1 = pd.DataFrame(data1, columns=['ID', 'Name', 'Age'])
# Create DataFrame for Dataset 2
data2 = [[1, 'Male', 25000],
[2, 'Male', 42500],
[3, 'Female', 22300],
[4, 'Male', 51400],
[5, 'Female', 26800]]
df2 = pd.DataFrame(data2, columns=['ID', 'Gender', 'Salary'])
Method 1: Inner Merge
An inner merge returns only the rows that have matching values in both DataFrames. If a row doesnt have a corresponding match in either DataFrame, it is excluded.
- Call the merge() function with how=inner.
Python
# Perform an inner merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)
This merge includes only the rows where the ID exists in both DataFrames. In this case, the rows for ID = 1, 2, 4, and 5 will be included in the result.
Method 2: Left Merge
A left merge returns all the rows from the left DataFrame (first DataFrame) and the matching rows from the right DataFrame. If there is no match, the result will contain NaN values for columns from the right DataFrame.
- Call the merge() function with how=left.
Python
# Perform a left merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='left')
print(merged_df)
This method ensures all rows from the left DataFrame (df1) are kept, and columns from df2 are added where a matching ID is found. If no match is found, NaN is used for the columns from df2.
Method 3: Right Merge
A right merge is the opposite of a left merge. It returns all the rows from the right DataFrame (second DataFrame) and the matching rows from the left DataFrame.
- Call the merge() function with how=right.
Python
# Perform a right merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='right')
print(merged_df)
This merge ensures that all rows from the right DataFrame (df2) are included. If no match is found for a row in df1, the left DataFrame’s columns will be NaN.
Method 4: Outer Merge
An outer merge returns all rows from both DataFrames, with `NaN` for missing values where there is no match. This is useful when you want to keep all the data from both DataFrames.
- Call the merge() function with how=outer.
Python
# Perform an outer merge on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='outer')
print(merged_df)
This method includes all rows from both DataFrames. If a row in df1 doesnt have a match in df2, or vice versa, the corresponding columns will have NaN values.
Method 5: Merging on Multiple Columns
You can merge DataFrames based on multiple columns by passing a list of column names to the on parameter. This allows you to perform more complex merges when the matching criteria are based on multiple columns.
- Call the merge() function with a list of column names.
Python
# Example with additional columns
df1['Country'] = ['USA', 'Canada', 'USA', 'Canada', 'USA']
df2['Country'] = ['USA', 'Canada', 'USA', 'Canada', 'Mexico']
# Merge on 'ID' and 'Country'
merged_df = pd.merge(df1, df2, on=['ID', 'Country'], how='inner')
print(merged_df)
Here, the merge is based on both the ID and Country columns. Only rows that have matching values in both columns from both DataFrames will be included.
You can refer this article for more detailed explanation: Merge Join two dataframes on multiple columns in Pandas
Method 6: Merging with Different Column Names
If the columns you want to merge on have different names in the two DataFrames, you can use the left_on and right_on parameters to specify the columns to merge on.
- Use left_on and right_on to specify the merge columns.
Python
# Rename 'ID' in df2 to 'EmployeeID' for demonstration
df2.rename(columns={'ID': 'EmployeeID'}, inplace=True)
# Merge using different column names
merged_df = pd.merge(df1, df2, left_on='ID', right_on='EmployeeID', how='inner')
print(merged_df)
Here, we merged the DataFrames using different column names: ID in df1 and EmployeeID in df2. The left_on and right_on parameters allow you to specify which columns to use for the merge.
You can refer this article for more detailed explanation: Merge two dataframes with different columns
Summary:
- Inner merge: Includes only the matching rows from both DataFrames.
- Left merge: Keeps all rows from the left DataFrame, adding matching rows from the right.
- Right merge: Keeps all rows from the right DataFrame, adding matching rows from the left.
- Outer merge: Includes all rows from both DataFrames, with NaN where there are no matches.
- Merge on multiple columns: Combine DataFrames based on multiple criteria.
- Merge with different column names: Specify different columns to merge using left_on and right_on.
Recommendation: For most scenarios, the inner merge is the most commonly used, but depending on your data and the relationship between the two DataFrames, other merge types may be more appropriate. Experiment with these options to find the best fit for your data merging needs.
Similar Reads
Pandas DataFrame
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal comp
11 min read
Pandas Dataframe Difference
When working with multiple DataFrames, you might want to compute the differences between them, such as identifying rows that are in one DataFrame but not in another. Pandas provides various ways to compute the difference between DataFrames, whether it's comparing rows, columns, or entire DataFrames.
4 min read
Pandas Dataframe Index
Index in pandas dataframe act as reference for each row in dataset. It can be numeric or based on specific column values. The default index is usually a RangeIndex starting from 0, but you can customize it for better data understanding. You can easily access the current index of a dataframe using th
3 min read
Merge Multiple Dataframes - Pandas
Merging allow us to combine data from two or more DataFrames into one based on index values. This is used when we want to bring together related information from different sources. In Pandas there are different ways to combine DataFrames:1. Merging DataFrames Using merge()We use merge() when we want
3 min read
Pandas Join Dataframes
Joining DataFrames is a common operation in data analysis, where you combine two or more DataFrames based on common columns or indices. Pandas provides various methods to perform joins, allowing you to merge data in flexible ways. In this article, we will explore how to join DataFrames using methods
4 min read
Python | Pandas dataframe.melt()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.melt() function unpivots a DataFrame from wide format to long format,
2 min read
Python | Pandas dataframe.mul()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.Pandas dataframe.mul() function return multiplication of dataframe and other element- w
2 min read
Python | Pandas dataframe.mask()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.mask() function return an object of same shape as self and whose corr
3 min read
Pandas DataFrame.reset_index()
In Pandas, reset_index() method is used to reset the index of a DataFrame. By default, it creates a new integer-based index starting from 0, making the DataFrame easier to work with in various scenarios, especially after performing operations like filtering, grouping or multi-level indexing. Example
3 min read
Filter Pandas DataFrame by Time
In this article let's see how to filter pandas data frame by date. So we can filter python pandas data frame by date using the logical operator and loc() method. In the below examples we have a data frame that contains two columns the first column is Name and another one is DOB. Example 1: filter da
1 min read