Rename Duplicated Columns after Join in Pyspark dataframe
Last Updated :
26 Apr, 2025
In this article, we are going to learn how to rename duplicate columns after join in Pyspark data frame in Python.
A distributed collection of data grouped into named columns is known as a Pyspark data frame. While handling a lot of data, we observe that not all data is coming from one data frame, thus there is a need to merge two or more data frames together. The merge or join can be inner, outer, left, right, etc., but after join, if we observe that some of the columns are duplicates in the data frame, then we will get stuck and not be able to apply functions on the joined data frame. Thus, we have explained in this article, how to rename duplicated columns after join in Pyspark data frame.
Steps to rename duplicated columns after join in Pyspark data frame:
Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using the getOrCreate() function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, either read the CSV files for two data frames or create the two data frames using createDataFrame() function.
data_frame=spark_session.createDataFrame([(column_1_data), (column_2_data), (column_3_data)],
['column_name_1', 'column_name_2', 'column_name_3'])
or
data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
sep = ',', inferSchema = True, header = True)
Step 4: Further, join the two data frames by choosing a common column in them. We can use any type of join, left, right, inner, outer, etc. Finally, rename the duplicate columns of either data frame using withColumnRenamed() function with parameters as the old column name and the new column name.
data_frame_1.withColumnRenamed(
'column_to_be_renamed','new_column_name_1').join(data_frame_2.withColumnRenamed(
'column_to_be_renamed','new_column_name_2'),
data_frame_1.column_to_be_joined_1 == data_frame_2.column_to_be_joined_2,"inner").show()
Example 1:
In this example, we have created two data frames, first with the fields 'Roll_Number,' 'Name,' 'Fine' and 'Department_Number,' and second with the fields 'Fees,' 'Fine' and 'Match_Department_Number.'
First Data Frame:
Second Data Frame:
Here, we have joined the two data frames using inner join through the columns 'Department_Number' of first data frame with the 'Match_Department_Number' of second data frame. As we didn't know the index of column to be renamed, thus we have renamed the column of the second data frame using withColumnRenamed function with parameters as old column name, i.e. 'Fine' and the new column name, i.e., 'Updated Fine'.
Python3
# Rename duplicated columns after join in Pyspark
# dataframe if you don't know the index of column
# Import the libraries SparkSession and col libraries
from pyspark.sql import SparkSession
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Create the first data frame using createDataFrame function
data_frame_1=spark_session.createDataFrame(
[(1, 'Arun', 100,112), (2, 'Ishita' , 200,123),
(3, 'Vinayak' , 400, 112)],
['Roll_Number', 'Name', 'Fine', 'Department_Number'])
# Create the second data frame using createDataFrame function
data_frame_2=spark_session.createDataFrame(
[(10000, 400, 112), (14000 , 500, 123),
(12000 , 800, 136)],
['Fees', 'Fine', 'Match_Department_Number'])
# Renaming duplicated columns after join while not
# knowing index using withColumnRenamed function
data_frame_1.join(data_frame_2.withColumnRenamed(
"Fine","Updated Fine"),
data_frame_1.Department_Number == data_frame_2.Match_Department_Number,
"inner").show()
Output:
Example 2:
In this example, we have created two data frames, first with the fields 'Roll_Number,' 'Class,' and 'Subject,' while second with the fields 'Next_Class,' and 'Subject.'
First Data Frame:
Second Data Frame:
Here, we have joined the two data frames using outer join through the columns 'Class' of the first data frame by adding one with the 'Next_Class' of the second data frame. As we didn't know the index of the column to be renamed, thus we have renamed the column of the first data frame using withColumnRenamed() function with parameters as old column name, i.e., 'Subject' and the new column name, i.e., 'Previous Year Subject'.
Python3
# Rename duplicated columns after join in Pyspark
# dataframe if you don't know the index of column
# Import the libraries SparkSession and col libraries
from pyspark.sql import SparkSession
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Create the first data frame using createDataFrame function
data_frame_1=spark_session.createDataFrame(
[(1, 5, 'Maths'), (2, 6, 'English'), (3, 8, 'Science')],
['Roll_Number', 'Class', 'Subject'])
# Create the second data frame using createDataFrame function
data_frame_2=spark_session.createDataFrame(
[(6,'English'), (7, 'Social Science'), (9 ,'Computer')],
[ 'Next_Class', 'Subject'])
# Renaming duplicated columns after join while not
# knowing index using withColumnRenamed function
data_frame_1.withColumnRenamed(
"Subject","Previous Year Subject").join(data_frame_2,
data_frame_1.Class+1 == data_frame_2.Next_Class,
"outer").show()
Output:
Similar Reads
Removing duplicate columns after DataFrame join in PySpark
In this article, we will discuss how to remove duplicate columns after a DataFrame join in PySpark. Create the first dataframe for demonstration:Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark \ - exa
3 min read
Removing duplicate rows based on specific column in PySpark DataFrame
In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method: Syntax: dataframe.dropDuplicates(['column 1','column
1 min read
Remove duplicates from a dataframe in PySpark
In this article, we are going to drop the duplicate data from dataframe using pyspark in Python Before starting we are going to create Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creati
2 min read
How to avoid duplicate columns after join in PySpark ?
When working with PySpark, it's common to join two DataFrames. However, if the DataFrames contain columns with the same name (that aren't used as join keys), the resulting DataFrame can have duplicate columns. This is particularly relevant when performing self-joins or joins on multiple columns. Bel
2 min read
How to Find & Drop duplicate columns in a Pandas DataFrame?
Letâs discuss How to Find and drop duplicate columns in a Pandas DataFrame. First, Letâs create a simple Dataframe with column names 'Name', 'Age', 'Domicile', and 'Age'/'Marks'. Find Duplicate Columns from a DataFrameTo find duplicate columns we need to iterate through all columns of a DataFrame a
4 min read
How to duplicate a row N time in Pyspark dataframe?
In this article, we are going to learn how to duplicate a row N times in a PySpark DataFrame. Method 1: Repeating rows based on column value In this method, we will first make a PySpark DataFrame using createDataFrame(). In our example, the column "Y" has a numerical value that can only be used here
4 min read
How to delete columns in PySpark dataframe ?
In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing
2 min read
Filtering rows based on column values in PySpark dataframe
In this article, we are going to filter the rows based on column values in PySpark dataframe. Creating Dataframe for demonstration:Python3 # importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n
2 min read
Add new column with default value in PySpark dataframe
In this article, we are going to see how to add a new column with a default value in PySpark Dataframe. The three ways to add a column to PandPySpark as DataFrame with Default Value. Using pyspark.sql.DataFrame.withColumn(colName, col)Using pyspark.sql.DataFrame.select(*cols)Using pyspark.sql.SparkS
3 min read
How to rename multiple columns in PySpark dataframe ?
In this article, we are going to see how to rename multiple columns in PySpark Dataframe. Before starting let's create a dataframe using pyspark: Python3 # importing module import pyspark from pyspark.sql.functions import col # importing sparksession from pyspark.sql module from pyspark.sql import S
2 min read