Joining DataFrames is a common operation in data analysis, where you combine two or more DataFrames based on common columns or indices. Pandas provides various methods to perform joins, allowing you to merge data in flexible ways. In this article, we will explore how to join DataFrames using methods like merge(), join(), and concat() in Pandas.
Python
import pandas as pd
data1 = {'Name': ['John', 'Alice', 'Bob', 'Eve'],
'Age': [25, 30, 22, 35],
'Gender': ['Male', 'Female', 'Male', 'Female']}
df = pd.DataFrame(data)
print(df1)
data2 = {'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Salary': [50000, 55000, 40000, 48000]}
df2 = pd.DataFrame(data2)
print(df2)
We will use these datasets to demonstrate how to join DataFrames in various ways.
Joining DataFrames Using merge
The merge() function is used to combine DataFrames based on common columns or indices. It is the most flexible way to join DataFrames, offering different types of joins (inner, left, right, and outer) similar to SQL joins.
- Use merge() to join the DataFrames based on a common column.
Python
# Merge df1 and df2 on the 'Name' column
merged_df = pd.merge(df1, df2, on='Name', how='inner')
print(merged_df)
on='Name' specifies that the DataFrames will be merged based on the Name column.how='inner' performs an inner join, which only includes rows with matching values in the Name column from both DataFrames.Performing a Left Join Using merge
A left join returns all the rows from the left DataFrame (df1) and the matching rows from the right DataFrame (df2). If no match is found, NaN values are filled for columns from the right DataFrame.
- Use merge() with how='left' to perform a left join.
Python
# Perform a left join on 'Name'
left_joined_df = pd.merge(df1, df2, on='Name', how='left')
print(left_joined_df)
how='left' ensures that all rows from the left DataFrame (df1) are included, and only the matching rows from the right DataFrame (df2) are returned.If there is no match in df2, the Salary column will have NaN for that row.
Performing a Right Join Using merge
A right join returns all rows from the right DataFrame (df2) and the matching rows from the left DataFrame (df1). If no match is found, NaN values are filled for columns from the left DataFrame.
- Use merge() with how='right' to perform a right join.
Python
# Perform a right join on 'Name'
right_joined_df = pd.merge(df1, df2, on='Name', how='right')
print(right_joined_df)
how='right' ensures that all rows from the right DataFrame (df2) are included, and only the matching rows from the left DataFrame (df1) are returned.If there is no match in df1, the columns from df1 will have NaN.
Performing an Outer Join Using merge
An outer join returns all rows from both DataFrames. If a row in one DataFrame has no match in the other, NaN values are filled for the missing values.
- Use merge() with how='outer' to perform an outer join.
Python
# Perform an outer join on 'Name'
outer_joined_df = pd.merge(df1, df2, on='Name', how='outer')
print(outer_joined_df)
Joining DataFrames Using join
The join() method is another way to combine DataFrames, but it works by using the index of the DataFrames, not columns. It is often used when you have a DataFrame with a meaningful index and want to join another DataFrame based on that index.
- Use join() to join DataFrames based on the index.
Python
# Set 'Name' as the index for both DataFrames
df1.set_index('Name', inplace=True)
df2.set_index('Name', inplace=True)
# Join df1 with df2 on the index
joined_df = df1.join(df2)
print(joined_df)
The join() method merges DataFrames using their indexes. By setting the Name column as the index, we can join the DataFrames based on the index values.
Concatenating DataFrames Using concat
The concat() method allows you to concatenate DataFrames either vertically (along rows) or horizontally (along columns). This is different from a SQL-style join and is useful when you want to combine DataFrames along a particular axis.
Python
# Concatenate df1 and df2 along rows (vertical concatenation)
concatenated_df = pd.concat([df1, df2], axis=0)
print(concatenated_df)
The concat() method concatenates DataFrames along a particular axis. Setting axis=0 combines them along rows (vertical concatenation), while axis=1 would concatenate along columns (horizontal concatenation).
Summary:
Joining DataFrames is an essential operation in data analysis. Pandas provides flexible methods for combining DataFrames, including:
- merge(): Allows you to perform SQL-like joins (inner, left, right, outer).
- join(): Joins DataFrames based on their indexes.
- concat(): Concatenates DataFrames along rows or columns.
By understanding and using these methods, you can efficiently combine data from multiple sources to perform more complex analyses.
Related Articles:
Similar Reads
Python Tutorial | Learn Python Programming Language Python Tutorial â Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly.Python is:A high-level language, used in web development, data science, automatio
10 min read
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.It can
5 min read
Python Interview Questions and Answers Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Python OOPs Concepts Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. By understanding the core OOP principles (classes, objects, inheritance, encapsulation, polymorphism, and abstraction), programmers can leverage the full p
11 min read
Linear Regression in Machine learning Linear regression is a type of supervised machine-learning algorithm that learns from the labelled datasets and maps the data points with most optimized linear functions which can be used for prediction on new datasets. It assumes that there is a linear relationship between the input and output, mea
15+ min read
Python Projects - Beginner to Advanced Python is one of the most popular programming languages due to its simplicity, versatility, and supportive community. Whether youâre a beginner eager to learn the basics or an experienced programmer looking to challenge your skills, there are countless Python projects to help you grow.Hereâs a list
10 min read
Support Vector Machine (SVM) Algorithm Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It tries to find the best boundary known as hyperplane that separates different classes in the data. It is useful when you want to do binary classification like spam vs. not spam or
9 min read
Python Exercise with Practice Questions and Solutions Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test
9 min read
Python Programs Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co
11 min read
Logistic Regression in Machine Learning Logistic Regression is a supervised machine learning algorithm used for classification problems. Unlike linear regression which predicts continuous values it predicts the probability that an input belongs to a specific class. It is used for binary classification where the output can be one of two po
11 min read