How to check if something is a RDD or a DataFrame in PySpark ?
Last Updated :
23 Nov, 2022
In this article we are going to check the data is an RDD or a DataFrame using isinstance(), type(), and dispatch methods.
It is used to check particular data is RDD or dataframe. It returns the boolean value.
Syntax: isinstance(data,DataFrame/RDD)
where
- data is our input data
- DataFrame is the method from pyspark.sql module
- RDD is the method from pyspark.sql module
Example Program to check our data is dataframe or not:
Python3
# importing module
import pyspark
#import DataFrame
from pyspark.sql import DataFrame
# importing sparksession
# from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession
# and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [[1, "sravan", "company 1"],
[2, "ojaswi", "company 1"],
[3, "rohith", "company 2"],
[4, "sridevi", "company 1"],
[1, "sravan", "company 1"],
[4, "sridevi", "company 1"]]
# specify column names
columns = ['ID', 'NAME', 'Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# check if it is dataframe or not
print(isinstance(dataframe, DataFrame))
Output:
True
Check the data is RDD or not:
By using isinstance() method we can check.
Syntax: isinstance(data,RDD)
where
- data is our input data
- RDDis the method from pyspark.sql module
Example:
Python3
# import DataFrame
from pyspark.sql import DataFrame
# import RDD
from pyspark.rdd import RDD
# need to import for session creation
from pyspark.sql import SparkSession
# creating the spark session
spark = SparkSession.builder.getOrCreate()
# create an rdd with some data
data = spark.sparkContext.parallelize([("1", "sravan", "vignan", 67, 89),
("2", "ojaswi", "vvit", 78, 89),
("3", "rohith", "vvit", 100, 80),
("4", "sridevi", "vignan", 78, 80),
("1", "sravan", "vignan", 89, 98),
("5", "gnanesh", "iit", 94, 98)])
# check the data is rdd or not
print(isinstance(data, RDD))
Output:
True
Convert the RDD into DataFrame and check the type
Here we will create an RDD and convert it to dataframe using toDF() method and check the data.
Python3
# import DataFrame
from pyspark.sql import DataFrame
# import RDD
from pyspark.rdd import RDD
# need to import for session creation
from pyspark.sql import SparkSession
# creating the spark session
spark = SparkSession.builder.getOrCreate()
# create an rdd with some data
rdd = spark.sparkContext.parallelize([(1, "Sravan", "vignan", 98),
(2, "bobby", "bsc", 87)])
# check if it is an RDD
print(" RDD : ", isinstance(rdd, RDD))
# check if it is an DataFrame
print("Dataframe : ", isinstance(rdd, DataFrame))
# display data of rdd
print("Rdd Data : \n", rdd.collect())
# convert rdd to dataframe
data = rdd.toDF()
# check if it is an RDD
print("RDD : ", isinstance(rdd, RDD))
# check if it is an DataFrame
print("Dataframe : ", isinstance(rdd, DataFrame))
# display dataframe
data.collect()
Output:

Method 2: Using type() function
type() command is used to return the type of the given object.
Syntax: type(data_object)
Here, dataobject is the rdd or dataframe data.
Example 1: Python program to create data with RDD and check the type
Python3
# need to import for session creation
from pyspark.sql import SparkSession
# creating the spark session
spark = SparkSession.builder.getOrCreate()
# create an rdd with some data
rdd = spark.sparkContext.parallelize([(1, "Sravan","vignan",98),
(2, "bobby","bsc",87)])
# check the type using type() command
print(type(rdd))
Output:
<class 'pyspark.rdd.RDD'>
Example 2: Python program to create dataframe and check the type.
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data =[[1,"sravan","company 1"],
[2,"ojaswi","company 1"],
[3,"rohith","company 2"],
[4,"sridevi","company 1"],
[1,"sravan","company 1"],
[4,"sridevi","company 1"]]
# specify column names
columns=['ID','NAME','Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
# check that type of
# data with type() command
print(type(dataframe))
Output:
<class 'pyspark.sql.dataframe.DataFrame'>
Method 3: Using Dispatch
The dispatch decorator creates a dispatcher object with the name of the function and stores this object, We can refer to this object to do the operations. Here we are creating an object to check our data is either RDD or DataFrame. So we are using single dispatch
Example 1: Python code to create a single dispatcher and pass the data and check the data is rdd or not
Python3
# importing module
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# import singledispatch
from functools import singledispatch
# import spark context
from pyspark import SparkContext
# createan object for spark
# context with local and name is GFG
sc = SparkContext("local", "GFG")
# creating sparksession
# and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# import DataFrame
# import RDD
# creating the spark session
spark = SparkSession.builder.getOrCreate()
# create a function to dispatch our function
@singledispatch
def check(x):
pass
# this function is for returning
# an RDD if the given input is RDD
@check.register(RDD)
def _(arg):
return "RDD"
# this function is for returning
# an RDD if the given input is DataFrame
@check.register(DataFrame)
def _(arg):
return "DataFrame"
# create an pyspark dataframe
# and check whether it is RDD or not
print(check(sc.parallelize([("1", "sravan", "vignan", 67, 89)])))
Output:
RDD
Example 2: Python code to check whether the data is dataframe or not
Python3
# importing module
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# import singledispatch
from functools import singledispatch
# import spark context
from pyspark import SparkContext
# createan object for spark
# context with local and name is GFG
sc = SparkContext("local", "GFG")
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# creating the spark session
spark = SparkSession.builder.getOrCreate()
# create a function to dispatch our function
@singledispatch
def check(x):
pass
# this function is for returning
# an RDD if the given input is RDD
@check.register(RDD)
def _(arg):
return "RDD"
# this function is for returning
# an RDD if the given input is DataFrame
@check.register(DataFrame)
def _(arg):
return "DataFrame"
# create an pyspark dataframe and
# check whether it is dataframe or not
print(check(spark.createDataFrame([("1", "sravan",
"vignan", 67, 89)])))
Output:
DataFrame
Similar Reads
How to check for a substring in a PySpark dataframe ?
In this article, we are going to see how to check for a substring in PySpark dataframe. Substring is a continuous sequence of characters within a larger string size. For example, "learning pyspark" is a substring of "I am learning pyspark from GeeksForGeeks". Let us look at different ways in which w
5 min read
How to Check if PySpark DataFrame is empty?
In this article, we are going to check if the Pyspark DataFrame or Dataset is Empty or Not. At first, let's create a dataframe Python3 # import modules from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType # defining schema schema = StructType([ Struc
1 min read
How to check the schema of PySpark DataFrame?
In this article, we are going to check the schema of pyspark dataframe. We are going to use the below Dataframe for demonstration. Method 1: Using df.schema Schema is used to return the columns along with the type. Syntax: dataframe.schema Where, dataframe is the input dataframe Code: Python3 # impo
2 min read
How to See Record Count Per Partition in a pySpark DataFrame
The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. Whenever we upload any file in the Pyspark, it creates a partition of that data equal to the number of cores. The user can repartition that data and
4 min read
How to Check the Schema of DataFrame in Scala?
With DataFrames in Apache Spark using Scala, you could check the schema of a DataFrame and get to know its structure with column types. The schema contains data types and names of columns that are available in a DataFrame. Apache Spark is a powerful distributed computing framework used for processin
3 min read
How to select a range of rows from a dataframe in PySpark ?
In this article, we are going to select a range of rows from a PySpark dataframe. It can be done in these ways: Using filter().Using where().Using SQL expression. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pys
3 min read
How to Iterate over rows and columns in PySpark dataframe
In this article, we will discuss how to iterate rows and columns in PySpark dataframe. Create the dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app nam
6 min read
How to select last row and access PySpark dataframe by index ?
In this article, we will discuss how to select the last row and access pyspark dataframe by index. Creating dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving
2 min read
How to add column sum as new column in PySpark dataframe ?
In this article, we are going to see how to perform the addition of New columns in Pyspark dataframe by various methods. It means that we want to create a new column that will contain the sum of all values present in the given row. Now let's discuss the various methods how we add sum as new columns
4 min read
How to select and order multiple columns in Pyspark DataFrame ?
In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy
2 min read