Append data to an empty dataframe in PySpark
Last Updated :
05 Apr, 2022
In this article, we are going to see how to append data to an empty DataFrame in PySpark in the Python programming language.
Method 1: Make an empty DataFrame and make a union with a non-empty DataFrame with the same schema
The union() function is the most important for this operation. It is used to mix two DataFrames that have an equivalent schema of the columns.
Syntax : FirstDataFrame.union(Second DataFrame)
Returns : DataFrame with rows of both DataFrames.
Example:
In this example, we create a DataFrame with a particular schema and data create an EMPTY DataFrame with the same scheme and do a union of these two DataFrames using the union() function in the python language.
Python
# Importing PySpark and the SparkSession
# DataType functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# Creating a spark session
spark_session = SparkSession.builder.appName(
'Spark_Session').getOrCreate()
# Creating an empty RDD to make a
# DataFrame with no data
emp_RDD = spark_session.sparkContext.emptyRDD()
# Defining the schema of the DataFrame
columns1 = StructType([StructField('Name', StringType(), False),
StructField('Salary', IntegerType(), False)])
# Creating an empty DataFrame
first_df = spark_session.createDataFrame(data=emp_RDD,
schema=columns1)
# Printing the DataFrame with no data
first_df.show()
# Hardcoded data for the second DataFrame
rows = [['Ajay', 56000], ['Srikanth', 89078],
['Reddy', 76890], ['Gursaidutt', 98023]]
columns = ['Name', 'Salary']
# Creating the DataFrame
second_df = spark_session.createDataFrame(rows, columns)
# Printing the non-empty DataFrame
second_df.show()
# Storing the union of first_df and
# second_df in first_df
first_df = first_df.union(second_df)
# Our first DataFrame that was empty,
# now has data
first_df.show()
Output :
+----+------+
|Name|Salary|
+----+------+
+----+------+
+----------+------+
| Name|Salary|
+----------+------+
| Ajay| 56000|
| Srikanth| 89078|
| Reddy| 76890|
|Gursaidutt| 98023|
+----------+------+
+----------+------+
| Name|Salary|
+----------+------+
| Ajay| 56000|
| Srikanth| 89078|
| Reddy| 76890|
|Gursaidutt| 98023|
+----------+------+
Method 2: Add a singular row to an empty DataFrame by converting the row into a DataFrame
We can use createDataFrame() to convert a single row in the form of a Python List. The details of createDataFrame() are :
Syntax : CurrentSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)
Parameters :
data :
- schema : str/list , optional: Contains a String or List of column names.
- samplingRatio : float, optional: A sample of rows for inference
- verifySchema : bool, optional: Verify data types of every row against the specified schema. The value is True by default.
Example:
In this example, we create a DataFrame with a particular schema and single row and create an EMPTY DataFrame with the same schema using createDataFrame(), do a union of these two DataFrames using union() function further store the above result in the earlier empty DataFrame and use show() to see the changes.
Python
# Importing PySpark and the SparkSession,
# DataType functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# Creating a spark session
spark_session = SparkSession.builder.appName(
'Spark_Session').getOrCreate()
# Creating an empty RDD to make a DataFrame
# with no data
emp_RDD = spark_session.sparkContext.emptyRDD()
# Defining the schema of the DataFrame
columns = StructType([StructField('Stadium', StringType(), False),
StructField('Capacity', IntegerType(), False)])
# Creating an empty DataFrame
df = spark_session.createDataFrame(data=emp_RDD,
schema=columns)
# Printing the DataFrame with no data
df.show()
# Hardcoded row for the second DataFrame
added_row = [['Motera Stadium', 132000]]
# Creating the DataFrame
added_df = spark_session.createDataFrame(added_row, columns)
# Storing the union of first_df and second_df
# in first_df
df = df.union(added_df)
# Our first DataFrame that was empty,
# now has data
df.show()
Output :
+-------+--------+
|Stadium|Capacity|
+-------+--------+
+-------+--------+
+--------------+--------+
| Stadium|Capacity|
+--------------+--------+
|Motera Stadium| 132000|
+--------------+--------+
Method 3: Convert the empty DataFrame into a Pandas DataFrame and use the append() function
We will use toPandas() to convert PySpark DataFrame to Pandas DataFrame. Its syntax is :
Syntax : PySparkDataFrame.toPandas()
Returns : Corresponding Pandas DataFrame
We will then use the Pandas append() function. Its syntax is :
Syntax : PandasDataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False)
Parameters :
- other : Pandas DataFrame, Numpy Series etc: The data that has to be appended.
- ignore_index : bool: If indexed a ignored then the indexes of the new DataFrame will have no relations to the older ones.
- sort : bool: Sort the columns if alignment of the columns in other and PandasDataFrame is different.
Example:
Here we create an empty DataFrame where data is to be added, then we convert the data to be added into a Spark DataFrame using createDataFrame() and further convert both DataFrames to a Pandas DataFrame using toPandas() and use the append() function to add the non-empty data frame to the empty DataFrame and ignore the indexes as we are getting a new DataFrame.Finally, we convert our final Pandas DataFrame to a Spark DataFrame using createDataFrame().
Python
# Importing PySpark and the SparkSession,
# DataType functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# Creating a spark session
spark_session = SparkSession.builder.appName(
'Spark_Session').getOrCreate()
# Creating an empty RDD to make a DataFrame
# with no data
emp_RDD = spark_session.sparkContext.emptyRDD()
# Defining the schema of the DataFrame
columns = StructType([StructField('Stadium', StringType(), False),
StructField('Capacity', IntegerType(), False)])
# Creating an empty DataFrame
df = spark_session.createDataFrame(data=emp_RDD,
schema=columns)
# Printing the DataFrame with no data
df.show()
# Hardcoded row for the second DataFrame
added_row = [['Motera Stadium', 132000]]
# Creating the DataFrame whose data
# needs to be added
added_df = spark_session.createDataFrame(added_row,
columns)
# converting our PySpark DataFrames to
# Pandas DataFrames
pandas_added = added_df.toPandas()
df = df.toPandas()
# using append() function to add the data
df = df.append(pandas_added, ignore_index=True)
# reconverting our DataFrame back
# to a PySpark DataFrame
df = spark_session.createDataFrame(df)
# Printing resultant DataFrame
df.show()
Output :
+-------+--------+
|Stadium|Capacity|
+-------+--------+
+-------+--------+
+--------------+--------+
| Stadium|Capacity|
+--------------+--------+
|Motera Stadium| 132000|
+--------------+--------+
Similar Reads
Append data to an empty Pandas DataFrame
Let us see how to append data to an empty Pandas DataFrame. Creating the Data Frame and assigning the columns to it python # importing the module import pandas as pd # creating the DataFrame of int and float a = [[1, 1.2], [2, 1.4], [3, 1.5], [4, 1.8]] t = pd.DataFrame(a, columns =["A",
2 min read
How to create an empty PySpark DataFrame ?
In PySpark, an empty DataFrame is one that contains no data. You might need to create an empty DataFrame for various reasons such as setting up schemas for data processing or initializing structures for later appends. In this article, weâll explore different ways to create an empty PySpark DataFrame
4 min read
How to create an empty dataframe in Scala?
In this article, we will learn how to create an empty dataframe in Scala. We can create an empty dataframe in Scala by using the createDataFrame method provided by the SparkSession object. Syntax to create an empty DataFrame: val df = spark.emptyDataFrame Example of How to create an empty dataframe
2 min read
How to Check if PySpark DataFrame is empty?
In this article, we are going to check if the Pyspark DataFrame or Dataset is Empty or Not. At first, let's create a dataframe Python3 # import modules from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType # defining schema schema = StructType([ Struc
1 min read
Concatenate two PySpark dataframes
In this article, we are going to see how to concatenate two pyspark dataframe using Python. Creating Dataframe for demonstration: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark - example join').getOr
3 min read
Add a row at top in pandas DataFrame
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Let's see how can we can add a row at top in pandas DataFrame.Observe this dataset first. Python3 # importing pandas module import pandas as pd # making data fram
1 min read
How to add a constant column in a PySpark DataFrame?
In this article, we are going to see how to add a constant column in a PySpark Dataframe. It can be done in these ways: Using Lit()Using Sql query. Creating Dataframe for demonstration: Python3 # Create a spark session from pyspark.sql import SparkSession from pyspark.sql.functions import lit spark
2 min read
Applying function to PySpark Dataframe Column
In this article, we're going to learn 'How we can apply a function to a PySpark DataFrame Column'. Apache Spark can be used in Python using PySpark Library. PySpark is an open-source Python library usually used for data analytics and data science. Pandas is powerful for data analysis but what makes
4 min read
Pandas Append Rows & Columns to Empty DataFrame
Appending rows and columns to an empty DataFrame in pandas is useful when you want to incrementally add data to a table without predefining its structure. To immediately grasp the concept, hereâs a quick example of appending rows and columns to an empty DataFrame using the concat() method, which is
4 min read
Create empty dataframe in Pandas
The Pandas Dataframe is a structure that has data in the 2D format and labels with it. DataFrames are widely used in data science, machine learning, and other such places. DataFrames are the same as SQL tables or Excel sheets but these are faster in use.Empty DataFrame could be created with the help
1 min read