How to avoid duplicate columns after join in PySpark ?

Last Updated : 05 Sep, 2024

When working with PySpark, it's common to join two DataFrames. However, if the DataFrames contain columns with the same name (that aren't used as join keys), the resulting DataFrame can have duplicate columns. This is particularly relevant when performing self-joins or joins on multiple columns. Below, we discuss methods to avoid these duplicate columns. In this article, we will discuss how to avoid duplicate columns in DataFrame after join in PySpark using Python.

Create the first dataframe for demonstration:

Python

# Importing module
import pyspark
from pyspark.sql import SparkSession

# Creating a SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# Data for the first DataFrame
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 1"],
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"],
        ["5", "bobby", "company 1"]]

# Column names for the first DataFrame
columns = ['ID', 'NAME', 'Company']

# Creating the first DataFrame
dataframe = spark.createDataFrame(data, columns)
dataframe.show()

Output:

Create a second dataframe for demonstration:

Python

# Data for the second DataFrame
data1 = [["1", "45000", "IT"],
         ["2", "145000", "Manager"],
         ["6", "45000", "HR"],
         ["5", "34000", "Sales"]]

# Column names for the second DataFrame
columns = ['ID', 'salary', 'department']

# Creating the second DataFrame
dataframe1 = spark.createDataFrame(data1, columns)
dataframe1.show()

Output:

Method 1: Using drop() function

After performing a join, we can use the drop() function to remove one of the duplicate columns.

dataframe.join(dataframe1, dataframe['ID'] == dataframe1['ID'], 'inner') performs an inner join on the 'ID' column.
.drop(dataframe['ID']) removes the 'ID' column from the first DataFrame, avoiding duplication.

Python

# Joining the DataFrames and dropping the 'ID' column from the first DataFrame
result = dataframe.join(dataframe1, dataframe['ID'] == dataframe1['ID'], 'inner').drop(dataframe['ID'])
result.show()

Output:

Method 2: Using join()

Another approach is to join the DataFrames and then explicitly select the columns you need, excluding duplicates.

dataframe.join(dataframe1, ['ID']) performs a join on the 'ID' column.
.select() allows you to specify the columns to include in the result, thus avoiding duplicate columns.

Python

# Performing the join and selecting relevant columns
result = dataframe.join(dataframe1, ['ID']).select(dataframe['ID'], dataframe['NAME'], dataframe['Company'], dataframe1['salary'], dataframe1['department'])
result.show()

Output:

Conclusion

By choosing our join methods and selecting columns, we can manage and avoid duplicate columns in our DataFrames. These techniques are useful in various scenarios, including self-joins and multi-column joins.