Panda Joins
Panda Joins
Pandas is an important Python tool to do data analysis and manipulation. Its typical use is
working with data frames while analyzing and manipulating data. While working on different data
frames, you can combine them using three different functions in four different ways.
It sounds rather similar, so what is the difference between these three approaches?
Merge() allows you to perform more flexible table joins because it provides you more
combinations, yet concat() is less structured. Join() combines data frames on the index but
not on columns, yet merge() gives you a chance to specify the column you want to join on.
In our examples, we will use merge() to show you how different types of joins in python work.
In simple terms, pandas joins in python are used to combine two data frames. When doing that,
you have to specify the type of join. The defined pandas join type specifies how the data frames
will be joined.
Now, let’s look at the types of python pandas joins with the merge function.
Types of Pandas Joins in Python?
There are four main types of pandas joins in python, which we will explain in this article.
● Inner
● Outer
● Left
● Right
Here is the official page of the Python Pandas “merge” function, which we will use to join two
data frames.
Meta developed a new programming language called Hack. To measure its popularity, they ran
a survey with their employees. Due to an error location, data was not collected, but your
supervisor demands a report showing the average popularity of Hach by office location.
Now the aim is to find the average popularity of the Hack per office location.
Output has to contain the location along with the average popularity.
Data
We have two data frames. Our first data frame is facebook_employees. The table has the
following columns.
The data preview is shown below.
The second data frame is facebook_hack_survey, and it has the following columns.
Coding
1. Let’s import the NumPy and Pandas libraries first to manipulate the data and use the
statistical methods with it.
import pandas as pd
import numpy as np
If you want to know how to import pandas as pd in python and its importance for doing data
science, check out our article “How to Import Pandas as pd in Python”.
We want to find the popularity of the Hack per office location. So the location and the popularity
should match, that’s why we need the intersection, so we will use inner join.
Selecting the right python join type is crucial to get the correct answer. In this case, the left and
inner join will return the same result. They will both return 14 rows, which are the commons of
both tables.
Yet, the right join will return the whole right data frame, which contains 17 rows, and for the rest,
there will be NA assigned on the left data frame.
Below is the info table of three data frames to see the information of the rows of the first, the
second, and the merged data frames.
OK, let’s get back to writing the answer using the inner join.
import pandas as pd
import numpy as np
merged = pd.merge(facebook_employees,facebook_hack_survey, left_on = 'id',
right_on = 'employee_id', how = 'inner')
3. The question wants us to return the average popularity based on the location, so let’s
use the groupby() function with mean() and reset indexes that groupby() creates
using the reset_index() method.
import pandas as pd
import numpy as np
Output
Here is the output, the average popularity based on the locations.
By the way, if you want to learn more about Pandas, here are Pandas Interview Questions for
Data Science.
Outer Join in Python
An app has product features to help guide users through a marketing funnel. Each funnel has
steps as a guide to complete the funnel. Meta asks us to find the average percentage of
completion for each feature.
Link to the question: https://platform.stratascratch.com/coding/9792-user-feature-completion
Data
We have two data frames. Our first data frame is facebook_product_features. The data frame
has the following columns.
Solution Approach
1. Load the pandas library.
2. Group by the feature_id and user id, and calculate the max step reached.
3. Merge two data frames on feature_id using the outer join and fill NAs with zero.
4. Calculate the share of completion by dividing the step reached with n_step times 100 to
find the percentage.
5. Group the data frame by feature_id and select the share of completion, calculate the
mean, reset the index, and save the results to frame.
Coding
1. Let’s import the pandas library first to manipulate the data.
import pandas as pd
2. Now here is the time to find the maximum step by grouping by the feature_id and
user_id first. Then select the step reached and use the max() function afterward. After
that, we will reset the index that the groupby() function creates.
import pandas as pd
max_step = facebook_product_features_realizations.groupby(["feature_id",
"user_id"])[
"step_reached"].max().reset_index()
3. Next, we have to calculate the share of completion. We will divide the step reached by
n_steps and multiply by 100. So we have to select n_steps from the first data frame and
step_reached from the second data frame. We will combine them on feature_id using
the outer join because we need all values from both data frames to do the math. The
non-matching values will be NA, so we will replace these values with zero after merging.
import pandas as pd
max_step = facebook_product_features_realizations.groupby(["feature_id",
"user_id"])[
"step_reached"].max().reset_index()
df = pd.merge(facebook_product_features, max_step, how='outer',
on='feature_id').fillna(0)
4. At this stage, we will calculate the share of completion by dividing the step reached by
the number of steps and multiplying by 100.
import pandas as pd
max_step = facebook_product_features_realizations.groupby(["feature_id",
"user_id"])[
"step_reached"].max().reset_index()
df = pd.merge(facebook_product_features, max_step, how='outer',
on='feature_id').fillna(0)
df["share_of_completion"] = (df["step_reached"] / df["n_steps"])*100
5. Now, we will group by the data frame by feature_id, select the share of completion and
calculate the mean. Then we will assign these results to the column
avg_share_of_completion and reset the index.
import pandas as pd
max_step = facebook_product_features_realizations.groupby(["feature_id",
"user_id"])[
"step_reached"].max().reset_index()
df = pd.merge(facebook_product_features, max_step, how='outer',
on='feature_id').fillna(0)
df["share_of_completion"] = (df["step_reached"] / df["n_steps"])*100
result =
df.groupby("feature_id")["share_of_completion"].mean().to_frame("avg_share_
of_completion").reset_index()
Output
Here is the expected output.
If you want to enhance your Python skills, here are Python Interview Questions and Answers.
Left Outer Join in Python
Amazon asks us to sort records based on the customer's first name and the order details in
ascending order.
Link to the question: https://platform.stratascratch.com/coding/9891-customer-details
Data
We have two data frames. The first data frame is customers. The data frame has the following
columns.
Coding
1. Now first, let’s import pandas and NumPy libraries.
import pandas as pd
import numpy as np
2. Here, we will merge both data frames using the left join because the output should
contain the sorted records based on the customer's first name and the order details. We
need the list of all customers.
import pandas as pd
import numpy as np
3. It is time to sort values according to the first name and order details and select the first
name, last name, city, and order details.
import pandas as pd
import numpy as np
merged = pd.merge(customers, orders, left_on = 'id', right_on = 'cust_id',
how = 'left')
result =
merged[['first_name','last_name','city','order_details']].sort_values(['fir
st_name','order_details'])
Output
Here is the expected output.
Right Outer Join in Python
Data
We have two data frames.The orders data frame has the following columns.
The second data frame is customers. The data frame has the following columns.
Also, here is the preview.
Solution Approach
1. First, let’s load the libraries.
2. Merge two data frames from the right.
3. Find the customers without an order by using the is_null() method.
4. Find the number of customers without an order by using the len() method.
Coding
1. Let’s import the NumPy and pandas libraries first to manipulate the data and use the
statistical methods with it.
import pandas as pd
import numpy as np
2. Now, we will merge these two data frames from the right to find the number of customers
without an order. After merging two data frames from the right, the customer's order data
column will be null if there’s no order. And we can find the customers who haven’t had
any orders in the next step.
import pandas as pd
import numpy as np
merged =
pd.merge(orders,customers,left_on='cust_id',right_on='id',how='right')
3. Here we will use the isnull() function to find the customer ids that don't have any
orders.
import pandas as pd
import numpy as np
merged =
pd.merge(orders,customers,left_on='cust_id',right_on='id',how='right')
null_cust = merged[merged['cust_id'].isnull()]
4. By using the len() function, we find the number of customers that have not had any
orders.
import pandas as pd
import numpy as np
merged =
pd.merge(orders,customers,left_on='cust_id',right_on='id',how='right')
null_cust = merged[merged['cust_id'].isnull()]
result = len(null_cust)
Output
Here is the expected output.
If you want to discover the join in SQL too, here are Different Types of SQL JOINs.
Conclusion
In this article, you learned about four pandas joins in python through the interview questions by
the companies like Meta and Amazon. These questions showed you how to use the joins in
python and, more specifically, where to use them while doing data manipulation step by step.
Practicing similar interview questions will keep you ready for interviews. You should turn it into a
habit. So join the StrataScratch community and sign up today to help us find your dream job.