PySpark - Merge Two DataFrames with Different Columns or Schema
Last Updated :
27 Jan, 2022
In this article, we will discuss how to merge two dataframes with different amounts of columns or schema in PySpark in Python.
Let's consider the first dataframe:
Here we are having 3 columns named id, name, and address for better demonstration purpose.
Python3
# importing module
import pyspark
# import when and lit function
from pyspark.sql.functions import when, lit
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "kakumanu"],
["2", "ojaswi", "hyd"],
["3", "rohith", "delhi"],
["4", "sridevi", "kakumanu"],
["5", "bobby", "guntur"]]
# specify column names
columns = ['ID', 'NAME', 'Address']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
# display
dataframe1.show()
Output:

Let's consider the second dataframe
Here we are going to create a dataframe with 2 columns.
Python3
# importing module
import pyspark
# import when and lit function
from pyspark.sql.functions import when, lit
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", 23],
["2", 21],
["3", 32],
]
# specify column names
columns = ['ID', 'Age']
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
# display
dataframe2.show()
Output:
We can not merge the data frames because the columns are different, so we have to add the missing columns. Here In first dataframe (dataframe1) , the columns  ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are  ['ID','Age'].
Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.sql.functions which are used to add a column with a value. Here we are going to add a value with None.
Syntax:
for column in [column for column in dataframe1.columns if column not in dataframe2.columns]:
  dataframe2 = dataframe2.withColumn(column, lit(None))
where,
- dataframe1 is the firstdata frame
- dataframe2 is the second dataframe
Add missing columns to both the dataframes
In both the data frames we are going to add the Age column to the first dataframe and NAME and Address in the second dataframe using the above syntax.
Finally, we are displaying the column names of both data frames.
Python3
# importing module
import pyspark
# import lit function
from pyspark.sql.functions import lit
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "kakumanu"],
["2", "ojaswi", "hyd"],
["3", "rohith", "delhi"],
["4", "sridevi", "kakumanu"],
["5", "bobby", "guntur"]]
# specify column names
columns = ['ID', 'NAME', 'Address']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
# list of employee data
data = [["1", 23],
["2", 21],
["3", 32],
]
# specify column names
columns = ['ID', 'Age']
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
# add columns in dataframe1 that are missing
# from dataframe2
for column in [column for column in dataframe2.columns
if column not in dataframe1.columns]:
dataframe1 = dataframe1.withColumn(column, lit(None))
# add columns in dataframe2 that are missing
# from dataframe1
for column in [column for column in dataframe1.columns
if column not in dataframe2.columns]:
dataframe2 = dataframe2.withColumn(column, lit(None))
# now see the columns of dataframe1
print(dataframe1.columns)
# now see the columns of dataframe2
print(dataframe2.columns)
Output:
['ID', 'NAME', 'Address', 'Age']
['ID', 'Age', 'NAME', 'Address']
Merging Dataframes
Method 1: Using union()
This will merge the data frames based on the position.
Syntax:Â
dataframe1.union(dataframe2)
Example:
In this example, we are going to merge the two data frames using union() method after adding the required columns to both the data frames. Finally, we are displaying the dataframe that is merged.
Python3
# importing module
import pyspark
# import lit function
from pyspark.sql.functions import lit
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "kakumanu"],
["2", "ojaswi", "hyd"],
["3", "rohith", "delhi"],
["4", "sridevi", "kakumanu"],
["5", "bobby", "guntur"]]
# specify column names
columns = ['ID', 'NAME', 'Address']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
# list of employee data
data = [["1", 23],
["2", 21],
["3", 32],
]
# specify column names
columns = ['ID', 'Age']
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
# add columns in dataframe1 that are missing
# from dataframe2
for column in [column for column in dataframe2.columns
if column not in dataframe1.columns]:
dataframe1 = dataframe1.withColumn(column, lit(None))
# add columns in dataframe2 that are missing
# from dataframe1
for column in [column for column in dataframe1.columns
if column not in dataframe2.columns]:
dataframe2 = dataframe2.withColumn(column, lit(None))
# perform union
dataframe1.union(dataframe2).show()
Output:

Method 2: Using unionByName()
This will merge the two data frames based on the column name.
Syntax:
dataframe1.unionByName(dataframe2)
Example:
In this example, we are going to merge the two data frames using unionByName() method after adding the required columns to both the dataframes. Finally, we are displaying the dataframe that is merged.
Python3
# importing module
import pyspark
# import lit function
from pyspark.sql.functions import lit
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "kakumanu"],
["2", "ojaswi", "hyd"],
["3", "rohith", "delhi"],
["4", "sridevi", "kakumanu"],
["5", "bobby", "guntur"]]
# specify column names
columns = ['ID', 'NAME', 'Address']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
# list of employee data
data = [["1", 23],
["2", 21],
["3", 32],
]
# specify column names
columns = ['ID', 'Age']
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
# add columns in dataframe1 that are missing
# from dataframe2
for column in [column for column in dataframe2.columns
if column not in dataframe1.columns]:
dataframe1 = dataframe1.withColumn(column, lit(None))
# add columns in dataframe2 that are missing
# from dataframe1
for column in [column for column in dataframe1.columns
if column not in dataframe2.columns]:
dataframe2 = dataframe2.withColumn(column, lit(None))
# perform unionByName
dataframe1.unionByName(dataframe2).show()
Output:

Method 3: Using unionAll()
Syntax:Â
dataframe1.unionAll(dataframe2)
Example:
In this example, we are going to merge the two dataframes using unionAll() method after adding the required columns to both the dataframes. Finally, we are displaying the dataframe that is merged.
Python3
# importing module
import pyspark
# import lit function
from pyspark.sql.functions import lit
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "kakumanu"],
["2", "ojaswi", "hyd"],
["3", "rohith", "delhi"],
["4", "sridevi", "kakumanu"],
["5", "bobby", "guntur"]]
# specify column names
columns = ['ID', 'NAME', 'Address']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
# list of employee data
data = [["1", 23],
["2", 21],
["3", 32],
]
# specify column names
columns = ['ID', 'Age']
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
# add columns in dataframe1 that are missing
# from dataframe2
for column in [column for column in dataframe2.columns
if column not in dataframe1.columns]:
dataframe1 = dataframe1.withColumn(column, lit(None))
# add columns in dataframe2 that are missing
# from dataframe1
for column in [column for column in dataframe1.columns
if column not in dataframe2.columns]:
dataframe2 = dataframe2.withColumn(column, lit(None))
# perform unionAll
dataframe1.unionAll(dataframe2).show()
Output:
Similar Reads
Pandas - Merge two dataframes with different columns
Let's understand how to merge two dataframes with different columns. In Pandas, you can merge two DataFrames with different columns using concat(), merge() and join(). Merging Two DataFrames with Different Columns - using concat()concat() method is ideal for combining multiple DataFrames vertically
3 min read
Merge two DataFrames with different amounts of columns in PySpark
In this article, we will discuss how to perform union on two dataframes with different amounts of columns in PySpark in Python. Let's consider the first dataframe Here we are having 3 columns named id, name, and address. Python3 # importing module import pyspark # import when and lit function from p
6 min read
Merge two dataframes with same column names
In this discussion, we will explore the process of Merging two dataframes with the same column names using Pandas. To achieve this, we'll leverage the functionality of pandas.concat(), pandas.join(), and pandas.merge() functions. These methods handle the concatenation operations along a specified ax
3 min read
How to delete columns in PySpark dataframe ?
In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing
2 min read
How to create PySpark dataframe with schema ?
In this article, we will discuss how to create the dataframe with schema using PySpark. In simple words, the schema is the structure of a dataset or dataframe. Functions Used:FunctionDescriptionSparkSessionThe entry point to the Spark SQL.SparkSession.builder()It gives access to Builder API that we
2 min read
Filter PySpark DataFrame Columns with None or Null Values
Many times while working on PySpark SQL dataframe, the dataframes contains many NULL/None values in columns, in many of the cases before performing any of the operations of the dataframe firstly we have to handle the NULL/None values in order to get the desired result or output, we have to filter th
4 min read
How to Iterate over rows and columns in PySpark dataframe
In this article, we will discuss how to iterate rows and columns in PySpark dataframe. Create the dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app nam
6 min read
Merge two DataFrames in PySpark
In this article, we will learn how to merge multiple data frames row-wise in PySpark. Outside chaining unions this is the only way to do it for DataFrames. The module used is pyspark : Spark (open-source Big-Data processing engine by Apache) is a cluster computing system. It is faster as compared to
4 min read
Select columns in PySpark dataframe
In this article, we will learn how to select columns in PySpark dataframe. Function used: In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats. Syntax: dataframe_name.select( columns_names ) Note: We
4 min read
PySpark - Select Columns From DataFrame
In this article, we will discuss how to select columns from the pyspark dataframe. To do this we will use the select() function. Syntax: dataframe.select(parameter).show() where, dataframe is the dataframe nameparameter is the column(s) to be selectedshow() function is used to display the selected
2 min read